# DataFrame Indexing

- COL(열)만 가져오기
```python
  - 기본인덱싱  :df[컬럼이름], df[ [컬럼 여러개] ], df[컬럼 슬라이싱]
  - loc 사용 : df.loc[:, 컬럼이름], df.loc[:, [컬럼이름 여러개]], df.loc[:, 컬럼슬라이싱]
  - iloc 사용 : df.iloc[:, 컬럼번호], df.iloc[:, [컬럼번호 여러개]], df.iloc[:, 컬럼번호 슬라이싱]
```
- ROW(행)만 가져오기
```python
  - 기본인덱싱 : df[행 슬라이싱]
                 ※ df[인덱스번호] => 열과 같이 가져올때만 가능???
  - loc 사용 : df.loc[행이름], df.loc[ [행 여러개] ], df.loc[행 슬라이싱]
  - iloc 사용 : df.iloc[인덱스번호], df.iloc[ [인덱스번호 여러개] ], df.iloc[인덱스번호슬라이싱]
```

- COL(행)과 ROW(열) 같이 가져오기 (loc)
```python
  - df.loc[행, 열]
  - 행,열에는 라벨, 번호인덱스 모두 가능 / 슬라이싱의 경우 끝번호, 라벨 포함 O
    - 행 : 인덱스번호 1개, [인덱스 여러개], 인덱스번호 슬라이싱
    - 열 : 컬럼1개, [컬럼 여러개], 컬럼이름 슬라이싱
```

- COL(행)과 ROW(열) 같이 가져오기 (iloc)
```python
  - df.iloc[행, 열]
  - 행,열에는 번호 인덱스만 가능 / 슬라이싱의 경우 끝번호 포함 x
    - 행 : 인덱스번호 1개, [인덱스 여러개], 인덱스번호 슬라이싱
    - 열 : 컬럼번호 1개, [컬럼번호 여러개], 컬럼번호 슬라이싱
```

##데이터 가져오기(seaborn => titanic 데이터)
dataset 은 seaborn에서 지원하는 titanic 데이터를 사용합니다.

In [1]:
import seaborn as sns
import pandas as pd

In [2]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
df.shape

(891, 15)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [None]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [None]:
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [None]:
df.isnull().sum().sum()

869

---

##기본 인덱싱(Indexing)

###열 인덱싱
```python
df [ '컬럼이름' ] # 하나의 열만 인덱싱
df [ [ '컬럼이름' ] ] # 하나의 열만 인덱싱하는데 데이터 프레임모양으로 출력
df [ [ '컬럼이름1', '컬럼이름2' ] ] # 여러 개의 열을 리스트로 인덱싱
```

In [None]:
df['survived']

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

In [None]:
df[ ['survived'] ]

Unnamed: 0,survived
0,0
1,1
2,1
3,1
4,0
...,...
886,0
887,1
888,0
889,1


In [None]:
df[ ['survived', 'sex'] ]

Unnamed: 0,survived,sex
0,0,male
1,1,female
2,1,female
3,1,female
4,0,male
...,...,...
886,0,male
887,1,female
888,0,female
889,1,male


###하나의 열정보 확인

In [None]:
df['survived'].values

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,

In [None]:
df['survived'].shape

(891,)

###행 인덱싱
- 열 인덱싱과 구분해서 기억해주세요
  - ```df[1:3]``` 행 인덱싱은 항상 슬라이싱 형식으로 가져오기
  - ```df[3:4]``` 단일행을 가져오려면 슬라이싱 형태로 만들어서 가져오기

- 부분 배열 불가능 ex) df[1,3,5] <- 오류

In [None]:
# 단일행 인덱싱
df[3:4]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False


In [None]:
# 연속된 다중행 인덱싱
df[1:3]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


In [None]:
# 열(COL) 1개, 행(ROW) 1개만 가져오기
df['embark_town'][3]

'Southampton'

In [None]:
# 열(COL) 1개, 행(ROW) 여러개 가져오기
df['embark_town'][1:3]

Unnamed: 0,embark_town
1,Cherbourg
2,Southampton


###불리언 인덱싱
- 불리언 인덱싱은 조건을 입력하여, 조건에 맞는 데이터를 조회하는 것을 말합니다.

In [None]:
# 성별(sex)이 여자(female)인 데이터만 조회하시오
df[df['sex'] == 'female']

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False
882,0,3,female,22.0,0,0,10.5167,S,Third,woman,False,,Southampton,no,True
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


In [None]:
# 성별(sex) 가 여자(female) 이고, 나이(age)가 20 보다 작은 데이터만 조회하시오
# 2가지 이상의 조건을 모두 만족하는 조회를 하는 경우 and연산자(&) 를 사용합니다.
df[ (df['sex'] == 'female') & (df['age'] < 20) ]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
10,1,3,female,4.0,1,1,16.7000,S,Third,child,False,G,Southampton,yes,False
14,0,3,female,14.0,0,0,7.8542,S,Third,child,False,,Southampton,no,True
22,1,3,female,15.0,0,0,8.0292,Q,Third,child,False,,Queenstown,yes,True
24,0,3,female,8.0,3,1,21.0750,S,Third,child,False,,Southampton,no,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
852,0,3,female,9.0,1,1,15.2458,C,Third,child,False,,Cherbourg,no,False
853,1,1,female,16.0,0,1,39.4000,S,First,woman,False,D,Southampton,yes,False
855,1,3,female,18.0,0,1,9.3500,S,Third,woman,False,,Southampton,yes,False
875,1,3,female,15.0,0,0,7.2250,C,Third,child,False,,Cherbourg,yes,True


In [None]:
# 클래스(class) 가 'First' 이거나  'Second' 인 데이터만 조회하시오
# 2가지 이상의 조건 중 하나만 만족해도 조회를 하는 경우 or연산자(|) 를 사용합니다.
df[ (df['class'] == 'First') | (df['class'] == 'Second') ]
#df[ df['class'].isin(['First', 'Second']) ]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False
11,1,1,female,58.0,0,0,26.5500,S,First,woman,False,C,Southampton,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
880,1,2,female,25.0,0,1,26.0000,S,Second,woman,False,,Southampton,yes,False
883,0,2,male,28.0,0,0,10.5000,S,Second,man,True,,Southampton,no,True
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True


### null 데이터 가지고오기

In [None]:
# age 컬럼이 null 인 데이터 가져오기
df[df['age'].isnull()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
17,1,2,male,,0,0,13.0000,S,Second,man,True,,Southampton,yes,True
19,1,3,female,,0,0,7.2250,C,Third,woman,False,,Cherbourg,yes,True
26,0,3,male,,0,0,7.2250,C,Third,man,True,,Cherbourg,no,True
28,1,3,female,,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
863,0,3,female,,8,2,69.5500,S,Third,woman,False,,Southampton,no,False
868,0,3,male,,0,0,9.5000,S,Third,man,True,,Southampton,no,True
878,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True


##iloc (정수번호로 인덱싱 / 끝번호 포함 x)
 - df.iloc[행, 열]
  - 행, 열에는 번호 인덱스만 가능 / 슬라이싱의 경우 끝 번호 포함 x
    - 행
      - 인덱스번호1개
      - [인덱스번호 여러개]
      - 인덱스번호 슬라이싱
    - 열
      - 컬럼번호1개
      - [컬럼번호 여러개]
      - 컬럼번호 슬라이싱
    ```ptyhon
    # 0번 행, 0번 열 가져오기
    df.iloc[0,0]
    # 1~2번 행, 0,2,3번 열 가져오기
    df.iloc[ [1,3], [0,2,3] ]
    # 2~4번 행, 1~2번 열 가져오기
    df.iloc[2:5, 1:3]
    # 1,3 행, 1~4번 열 가져오기
    df.iloc[ [1,3], 1:5]
    # 3번 행, 1~2번 열 가져오기
    df.iloc[3,[1,3] ]
    # 전체 데이터 조회
    df.iloc[:]
    df.iloc[:][:]
    df.iloc[:,:]
    ```

###하나의 행만 가져오기

In [None]:
# 하나의 행(0번 행)만 가져오기
df.iloc[0]

Unnamed: 0,0
survived,0
pclass,3
sex,male
age,22.0
sibsp,1
parch,0
fare,7.25
embarked,S
class,Third
who,man


In [None]:
# 하나의 행만 가져오기 (표 모양으로 가져오기)
# - 인덱스[] 안에 리스트를 넣어주면 표모양으로 가져오는데,
#   한 개의 값이더라도 리스트를 만들어주면 표모양으로 가져올 수 있어요
df.iloc[ [0] ]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


###여러개의 행 가져오기

In [None]:
# 자주 실수하는데 df.iloc[0,2,4]    => 오류
df.iloc[ [0, 2, 4] ]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
df.iloc[2:4]      # 2,3 번째 행의 데이터

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False


###행과 열 같이 가져오기

In [None]:
df.iloc[0,0]

0

In [None]:
df.iloc[ [1,3], [0,2,3] ]

Unnamed: 0,survived,sex,age
1,1,female,38.0
3,1,female,35.0


In [None]:
df.iloc[2:5, 1:3]

Unnamed: 0,pclass,sex
2,3,female
3,1,female
4,3,male


In [None]:
df.iloc[ [1,3], 1:5]

Unnamed: 0,pclass,sex,age,sibsp
1,1,female,38.0,1
3,1,female,35.0,1


In [None]:
df.iloc[ 3, [1,3] ]

Unnamed: 0,3
pclass,1.0
age,35.0


##loc (라벨로 인덱싱 / 끝번호 포함 o)
- df.loc[행, 열]
  - 행,열에는 라벨, 번호인덱스 모두 가능 / 슬라이싱의 경우 끝번호 라벨 포함 o
    - 행
      - 인덱스번호1개
      - [인덱스 여러개]
      - 인덱스 슬라이싱
    - 열
      - 컬럼 1개
      - [컬럼 여러개]
      - 컬럼이름 슬라이싱
  ```python
  # '1_index', '3_index' 행 가져오기
  df.loc[ ['1_index', '3_index'] ]
  # '3_index', '5_index' 행의 'sex', 'age' 컬럼 가져오기
  df.loc[ ['3_index', '5_index'], ['sex', 'age'] ]
  # '3_index'~'5_index' 행의 'sex'~'age' 컬럼 가져오기
  df.loc['3_index':'5_index', 'sex':'fare']
  # '3_index','5_index' 행의 'sex'~'fare' 컬럼 가져오기
  df.loc[ ['3_index','5_index'], 'sex':'fare']
  # '5_index' 행의 'sex','fare' 컬럼 가져오기
  df.loc['5_index', ['sex','fare'] ]
  # 전체 가져오기
  df.loc[:]
  df.loc[:][:]
  df.loc[:,:]
  ```

###하나의 행만 가져오기

In [None]:
#df.drop(columns=['index'])
#df.reset_index(inplace=True)

In [None]:
# loc 테스트관련 - 행 인덱스를 문자열로 변경해 줍니다.
df.index = df.index.astype('str') + '_index'

In [None]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0_index,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1_index,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2_index,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3_index,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4_index,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
df.loc['1_index']

Unnamed: 0,1_index
survived,1
pclass,1
sex,female
age,38.0
sibsp,1
parch,0
fare,71.2833
embarked,C
class,First
who,woman


In [None]:
df.loc[['1_index']]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1_index,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False


###여러 개의 행

In [None]:
df.loc[ ['1_index', '3_index', '7_index'] ]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
1_index,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
3_index,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
7_index,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


In [None]:
df.loc['3_index' : '7_index']

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
3_index,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4_index,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5_index,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6_index,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7_index,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False


###행과 열 같이 가져오기

In [None]:
df.loc[ ['3_index', '5_index'], ['sex', 'age'] ]

Unnamed: 0,sex,age
3_index,female,35.0
5_index,male,


In [None]:
df.loc['3_index':'5_index', 'sex':'fare']

Unnamed: 0,sex,age,sibsp,parch,fare
3_index,female,35.0,1,0,53.1
4_index,male,35.0,0,0,8.05
5_index,male,,0,0,8.4583


In [None]:
df.loc[ ['3_index','5_index'], 'sex':'fare']

Unnamed: 0,sex,age,sibsp,parch,fare
3_index,female,35.0,1,0,53.1
5_index,male,,0,0,8.4583


In [None]:
df.loc['5_index', ['sex','fare'] ]

Unnamed: 0,5_index
sex,male
fare,8.4583


In [None]:
df.loc[:, :]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0_index,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1_index,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2_index,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3_index,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4_index,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886_index,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887_index,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888_index,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889_index,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


---

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 891 entries, 0_index to 890_index
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 119.8+ KB


---
# 파이썬 함수

In [None]:
import seaborn as sns
import pandas as pd

In [None]:
print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']


In [None]:
tit = sns.load_dataset('titanic')
tit.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [None]:
tit.info(0)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### 파이썬 기초함수

### 조건 조회함수

In [None]:
# 조건 조회함수
import seaborn as sns
import pandas as pd
tit = sns.load_dataset('titanic')
# print(tit.head())
# print(tit.info())

print('1 : \r\n', tit['age'].between(10, 20))
print('2 : \r\n', tit['fare'].isin([10, 50]))
print('3 : \r\n', tit['deck'].isnull())
print('4 : \r\n', tit['age'].apply(lambda x : x * 2))
print('5 : \r\n', tit['class'].apply(lambda x : x[0] == 'T'))

1 : 
 0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887     True
888    False
889    False
890    False
Name: age, Length: 891, dtype: bool
2 : 
 0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888    False
889    False
890    False
Name: fare, Length: 891, dtype: bool
3 : 
 0       True
1      False
2       True
3      False
4       True
       ...  
886     True
887    False
888     True
889    False
890     True
Name: deck, Length: 891, dtype: bool
4 : 
 0      44.0
1      76.0
2      52.0
3      70.0
4      70.0
       ... 
886    54.0
887    38.0
888     NaN
889    52.0
890    64.0
Name: age, Length: 891, dtype: float64
5 : 
 0       True
1      False
2       True
3      False
4       True
       ...  
886    False
887    False
888     True
889    False
890     True
Name: class, Length: 891, dtype: bool


### 문자함수

In [None]:
# 문자함수
#형식 : 문자열.문자함수
import seaborn as sns
import pandas as pd
tit = sns.load_dataset('titanic')
# print(tit.head())
# print(tit.info())

print('upper : ', tit['embark_town'].str.upper())                         # 대문자 변환
print('lower : ', tit['embark_town'].str.lower())                         # 소문자 변환
print('capitalize : ', tit['embark_town'].str.capitalize())               # 첫 문자만 대문자 변환
print('slice : ', tit['embark_town'].str.slice(start = 0, stop = 2))      # 문자열 자르기 (start ~ stop)
print('len : ', tit['embark_town'].str.len())                             # 길이 구하기
print('strip : ', tit['embark_town'].str.strip())                         # 양쪽 공백 제거
print('lstrip : ', tit['embark_town'].str.lstrip())                       # 왼쪽 공백 제거
print('rstrip : ', tit['embark_town'].str.rstrip())                       # 오른쪽 공백 제거
print('replace : ', tit['embark_town'].str.replace('t', 'D'))             # 문자열 바꾸기(모든 일치하는 문자)
print('find : ', tit['embark_town'].str.find('S'))                        # 문자열 찾기(첫 번째 일치하는 위치 / 없으면 -1)
print('contains : ', tit['embark_town'].str.contains('찾을 문자열'))      # 문자열이 있으면 True, 없으면 False반환

upper :  0      SOUTHAMPTON
1        CHERBOURG
2      SOUTHAMPTON
3      SOUTHAMPTON
4      SOUTHAMPTON
          ...     
886    SOUTHAMPTON
887    SOUTHAMPTON
888    SOUTHAMPTON
889      CHERBOURG
890     QUEENSTOWN
Name: embark_town, Length: 891, dtype: object
lower :  0      southampton
1        cherbourg
2      southampton
3      southampton
4      southampton
          ...     
886    southampton
887    southampton
888    southampton
889      cherbourg
890     queenstown
Name: embark_town, Length: 891, dtype: object
capitalize :  0      Southampton
1        Cherbourg
2      Southampton
3      Southampton
4      Southampton
          ...     
886    Southampton
887    Southampton
888    Southampton
889      Cherbourg
890     Queenstown
Name: embark_town, Length: 891, dtype: object
slice :  0      So
1      Ch
2      So
3      So
4      So
       ..
886    So
887    So
888    So
889    Ch
890    Qu
Name: embark_town, Length: 891, dtype: object
len :  0      11.0
1       9.0
2      

### 집계함수

In [None]:
# 집계 함수
import seaborn as sns
import pandas as pd
tit = sns.load_dataset('titanic')
print(tit.head())
# print(tit.info())

print('max : ' , tit['fare'].max())         # 최대값
print('min : ', tit['fare'].min())          # 최소값
print('sum : ', tit['fare'].sum())          # 합계값
print('mean : ', tit['fare'].mean())        # 평균값
print('count : ', tit['fare'].count())      # 개수합계
print('median : ', tit['fare'].median())    # 중앙값
print('mode : ', tit['fare'].mode())        # 최빈값


   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  
max :  512.3292
min :  0.0
sum :  28693.9493
mean :  32.204207968574636
count :  891
median :  14.4542
mode :  0    8.05
Name: fare, dtype: float64


### 데이터 정렬하기

In [None]:
import seaborn as sns
import pandas as pd
tit = sns.load_dataset('titanic')
# print(tit.head())
# print(tit.info())

# 데이터 정렬하기
tit.sort_values(by = 'age', ascending = False)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
630,1,1,male,80.0,0,0,30.0000,S,First,man,True,A,Southampton,yes,True
851,0,3,male,74.0,0,0,7.7750,S,Third,man,True,,Southampton,no,True
493,0,1,male,71.0,0,0,49.5042,C,First,man,True,,Cherbourg,no,True
96,0,1,male,71.0,0,0,34.6542,C,First,man,True,A,Cherbourg,no,True
116,0,3,male,70.5,0,0,7.7500,Q,Third,man,True,,Queenstown,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
859,0,3,male,,0,0,7.2292,C,Third,man,True,,Cherbourg,no,True
863,0,3,female,,8,2,69.5500,S,Third,woman,False,,Southampton,no,False
868,0,3,male,,0,0,9.5000,S,Third,man,True,,Southampton,no,True
878,0,3,male,,0,0,7.8958,S,Third,man,True,,Southampton,no,True


### 날짜 처리

In [None]:
df1 = sns.load_dataset('dowjones')
df1.head()

# 날짜 처리
df1['날짜'] = pd.to_datetime(df1['Date'])   # 날짜 형변환

df1['년'] = df1['날짜'].dt.year
df1['월'] = df1['날짜'].dt.month
df1['일'] = df1['날짜'].dt.day
df1['분기'] = df1['날짜'].dt.quarter

df1

Unnamed: 0,Date,Price,날짜,년,월,일,분기
0,1914-12-01,55.00,1914-12-01,1914,12,1,4
1,1915-01-01,56.55,1915-01-01,1915,1,1,1
2,1915-02-01,56.00,1915-02-01,1915,2,1,1
3,1915-03-01,58.30,1915-03-01,1915,3,1,1
4,1915-04-01,66.45,1915-04-01,1915,4,1,2
...,...,...,...,...,...,...,...
644,1968-08-01,883.72,1968-08-01,1968,8,1,3
645,1968-09-01,922.80,1968-09-01,1968,9,1,3
646,1968-10-01,955.47,1968-10-01,1968,10,1,4
647,1968-11-01,964.12,1968-11-01,1968,11,1,4


### rank & qcut
- rank : 순위 정하기
- qcut : 등급 나누기

In [None]:
# rank 함수
df = sns.load_dataset('car_crashes')
print(df.head())
# print(df.info())

df1 = df.copy()

# Case1
df1['순위_alcohol'] = df1['alcohol'].rank(method = 'dense', ascending = True)
# df1.sort_values(by='순위_alcohol')
# 등급 나누기
df1['alcohol_grade'] = pd.qcut(df1['순위_alcohol'], q = 5, labels = [1, 2, 3, 4, 5])
df1.sort_values(by = ['순위_alcohol', 'alcohol_grade']).head(10)


   total  speeding  alcohol  not_distracted  no_previous  ins_premium  \
0   18.8     7.332    5.640          18.048       15.040       784.55   
1   18.1     7.421    4.525          16.290       17.014      1053.48   
2   18.6     6.510    5.208          15.624       17.856       899.47   
3   22.4     4.032    5.824          21.056       21.280       827.34   
4   12.0     4.200    3.360          10.920       10.680       878.41   

   ins_losses abbrev  
0      145.08     AL  
1      133.93     AK  
2      110.35     AZ  
3      142.39     AR  
4      165.63     CA  


Unnamed: 0,total,speeding,alcohol,not_distracted,no_previous,ins_premium,ins_losses,abbrev,순위_alcohol,alcohol_grade
8,5.9,2.006,1.593,5.9,5.9,1273.89,136.05,DC,1.0,1
44,11.3,4.859,1.808,9.944,10.848,809.38,109.48,UT,2.0,1
23,9.6,2.208,2.784,8.448,8.448,777.18,133.35,MN,3.0,1
21,8.2,1.886,2.87,7.134,6.56,1011.14,135.63,MA,4.0,1
30,11.2,1.792,3.136,9.632,8.736,1301.52,159.85,NJ,5.0,1
37,12.8,4.224,3.328,8.576,11.52,804.71,104.61,OR,6.0,1
4,12.0,4.2,3.36,10.92,10.68,878.41,165.63,CA,7.0,1
46,12.7,2.413,3.429,11.049,11.176,768.95,153.72,VA,8.0,1
29,11.6,4.06,3.48,10.092,9.628,746.54,120.21,NH,9.0,1
47,10.6,4.452,3.498,8.692,9.116,890.03,111.62,WA,10.0,1


# 파이썬 실습

___
## 기초탐색 (Tips 데이터를 이용한 문제)

In [None]:
import seaborn as sns
import pandas as pd

In [None]:
# titanic
# tips
# iris
print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']


In [None]:
df = sns.load_dataset('tips')
# total_bill : 총 계산 요금 (달러)	3.07~50.81	          실수(float)
# tip	       : 팁 (달러)	          1.0~10.0	            실수(float)
# sex	       : 성별	                Male / Female	        문자열(str)
# smoker	   : 흡연 여부	          Yes / No	            문자열(str)
# day	       : 요일	                Thur, Fri, Sat, Sun	  문자열(str)
# time	     : 식사 시간	          Lunch, Dinner	        문자열(str)
# size	     : 식사 인원	          1~6	                  정수(int)

df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [None]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [None]:
# 문제1 - total_bill 변수의 제1사분위수를 구하고 정수값으로 출력하시오.
result = df['total_bill'].quantile(0.25)
result = int(result)
print(result)

13


In [None]:
# 문제2 - total_bill 값이 20이상 25이하인 데이터의 수를 구하시오.
print(df[df['total_bill'].between(20, 25)].shape[0])
print(len(df[df['total_bill'].between(20, 25)]))

42
42


In [None]:
# 문제3 - tip변수의 IQR값을 구하시오.
# (IQR = Q3 - Q1 (3사분위수 - 1사분위수))
q3, q1 = df['tip'].quantile([0.75, 0.25])
# print(q3, q1)
IQR = q3 - q1
print(IQR)

1.5625


In [None]:
# 문제4 - tip변수의 상위 10개 값의 총합을 구하여 소수점을 버리고, 정수를 출력하시오.
result = df.sort_values(by='tip', ascending=False).iloc[0:10]
#print(result)
print(int(result['tip'].sum()))

70


In [None]:
# 문제5 - 전체 데이터에서 sex가 Female 인 비율이 얼마인지 소수점 첫째 자리까지 출력하시오.
sex_f = df[['sex']][df['sex'] == 'Female'].shape[0]
sex_all = df.shape[0]

print(sex_f, sex_all)
print(sex_f / sex_all)
print(round(sex_f / sex_all, 1))

87 244
0.35655737704918034
0.4


In [None]:
# 문제6 - 첫 번째 행부터 순서대로 10개 뽑은 후, total_bill열의 평균값을 반올림하여 정수로 출력하시오.
result = df.iloc[0:10]
# print(result)

print(round(result['total_bill'].mean()))

19


In [None]:
# 문제7 - 첫 번째 행부터 순서대로 50%까지 데이터를 뽑아 tip변수의 중앙값을 구하시오

cnt_50 = df.shape[0] / 2
print(df.shape[0], cnt_50)

df_50 = df.iloc[0:int(cnt_50)]
# print(df_50)

result = df_50['total_bill'].median()
print(result)

244 122.0
18.14


## 2. 결측치 처리 (mpg 데이터)

In [None]:
df = sns.load_dataset('mpg')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    392 non-null    float64
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
 8   name          398 non-null    object 
dtypes: float64(4), int64(3), object(2)
memory usage: 28.1+ KB


In [None]:
# 문제8 - 결측값이 있는 데이터의 수를 구하시오.
df[:][df['horsepower'].isnull()]    # horsepower 가 결측치 인 데이터 조회
print(df.isnull().sum())            # 컬럼별 결측치 개수 확인
print(df.isnull().sum().sum())      # 전체 결측치 개수 확인

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
name            0
dtype: int64
0


In [None]:
# 문제9 - horsepower 컬럼의 결측값을 horsepower의 평균값으로 대체하고, horsepower의 중앙값을 정수로 출력하시오
df['horsepower'] = df['horsepower'].fillna(df['horsepower'].mean())
#df.isnull().sum()

result = df['horsepower'].median()
print(int(result))

95


In [None]:
# 문제10 - horsepower컬럼에 결측치가 있는 행을 제거하고, 첫번째 행부터 순서대로 50%를 추출한후, 해당 데이터의 1사분위수를 구하시오

df1 = df.copy()
# df1.isnull().sum()

df1.dropna(inplace = True)
df1.isnull().sum()

cnt_50 = df.shape[0] / 2
# print(cnt_50)    # 199.0

result = df.iloc[0:int(cnt_50)]
# print(result)

result2 = result['horsepower'].quantile(0.25)
print(result2)

86.0


## 3. 이상치 처리 (mpg 데이터)

In [None]:
# 문제11 - clynder가 3인 자동차와 8인 자동차 그룹의 mpg 평균값 차이를 절대값 정수로 출력하시오
df.head()

c_3 = df['mpg'][df['cylinders'] == 3].mean()
c_8 = df['mpg'][df['cylinders'] == 8].mean()

print('c_3 : ', c_3, ', c_8 : ', c_8)

print(int(abs(c_3 - c_8)))

c_3 :  20.55 , c_8 :  14.963106796116506
5


In [None]:
# 문제12 - horsepower 변수를 z-score 표준화를 진행하고, 이상치를 구하시오. 이상치는 z값이 1.5를 초과하거나 -1.5미만인 값이다.
#          z-score 표준화 : z = (x - 평균) / 표준편차

horsepower_mean = df['horsepower'].mean()       # 평균
horsepower_std = df['horsepower'].std()         # 표준편차

df['z_horsepower'] = (df['horsepower'] - horsepower_mean) / horsepower_std

cond1 = df['z_horsepower'] > 1.5
cond2 = df['z_horsepower'] < -1.5
df[cond1 | cond2]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name,z_horsepower
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320,1.572585
5,15.0,8,429.0,198.0,4341,10.0,70,usa,ford galaxie 500,2.429924
6,14.0,8,454.0,220.0,4354,9.0,70,usa,chevrolet impala,3.001484
7,14.0,8,440.0,215.0,4312,8.5,70,usa,plymouth fury iii,2.871584
8,14.0,8,455.0,225.0,4425,10.0,70,usa,pontiac catalina,3.131384
9,15.0,8,390.0,190.0,3850,8.5,70,usa,amc ambassador dpl,2.222085
10,15.0,8,383.0,170.0,3563,10.0,70,usa,dodge challenger se,1.702485
13,14.0,8,455.0,225.0,3086,10.0,70,usa,buick estate wagon (sw),3.131384
19,26.0,4,97.0,46.0,1835,20.5,70,europe,volkswagen 1131 deluxe sedan,-1.519034
25,10.0,8,360.0,215.0,4615,14.0,70,usa,ford f250,2.871584


In [None]:
# 문제13 - mpg컬럼을 min-max정규화를 진행한 후, 0.8보다 큰 값을 가지는 레코드 수를 구하시오.
#          min-max scaling : (x - min) / (max - min)

mpg_min = df['mpg'].min()     # min (최소값)
mpg_max = df['mpg'].max()     # max (최대값)

df['min_max_mpg'] = (df['mpg'] - mpg_min) / (mpg_max - mpg_min)

cond1 = df['min_max_mpg'] > 0.8
print(df[cond1])
print(df[cond1].shape[0])

      mpg  cylinders  displacement  horsepower  weight  acceleration  \
244  43.1          4          90.0   48.000000    1985          21.5   
247  39.4          4          85.0   70.000000    2070          18.6   
309  41.5          4          98.0   76.000000    2144          14.7   
322  46.6          4          86.0   65.000000    2110          17.9   
324  40.8          4          85.0   65.000000    2110          19.2   
325  44.3          4          90.0   48.000000    2085          21.7   
326  43.4          4          90.0   48.000000    2335          23.7   
329  44.6          4          91.0   67.000000    1850          13.8   
330  40.9          4          85.0  104.469388    1835          17.3   
343  39.1          4          79.0   58.000000    1755          16.9   
394  44.0          4          97.0   52.000000    2130          24.6   

     model_year  origin                             name  min_max_mpg  
244          78  europe  volkswagen rabbit custom diesel     0.

In [None]:
# 문제14 - weight 컬럼에 대해 상자그림(박스플롯) 기준으로 이상치의 갯수를 구하시오
#          IQR = q3 - q1
#          ub = q3 + (1.5 * IQR)  - 상단 이상치 기준
#          lb = q1 - (1.5 * IQR)  - 하단 이상치 기준

df1 = df.copy()

q3, q1 = df1['weight'].quantile([0.75, 0.25])

IQR = q3 - q1
ub = q3 + (1.5 * IQR)
lb = q1 - (1.5 * IQR)
print(q3, q1, IQR, ub, lb)
# q3  : 3608.0
# q1  : 2223.75
# IQR : 1384.25
# ub  : 5684.375
# lb  : 147.375

cond1 = df1['weight'] < lb
cond2 = df1['weight'] > ub

df1['weight'][cond1 & cond2]

3608.0 2223.75 1384.25 5684.375 147.375


Unnamed: 0,weight


In [None]:
print(df1['weight'].min())      # 1613
print(df1['weight'].max())      # 5140

1613
5140


## 4. 데이터분석 (mpg 데이터)

In [None]:
# 문제15 - model_year가 80보다 큰 차량 중에 mpg값이 가장 큰 차량의 horsepower 값을 합산하여 출력하시오

result = df[df['model_year'] > 80].sort_values(by='mpg', ascending = False)
print(result.head(10))

max_mpg = int(result.iloc[0]['mpg'])
print('max_mpg : ', max_mpg)

print(result['horsepower'][result['mpg'] == max_mpg].sum())


      mpg  cylinders  displacement  horsepower  weight  acceleration  \
394  44.0          4          97.0        52.0    2130          24.6   
343  39.1          4          79.0        58.0    1755          16.9   
344  39.0          4          86.0        64.0    1875          16.4   
378  38.0          4         105.0        63.0    2125          14.7   
387  38.0          6         262.0        85.0    3015          17.0   
385  38.0          4          91.0        67.0    1995          16.2   
383  38.0          4          91.0        67.0    1965          15.0   
348  37.7          4          89.0        62.0    2050          17.3   
347  37.0          4          85.0        65.0    1975          19.4   
376  37.0          4          91.0        68.0    2025          18.2   

     model_year  origin                               name  min_max_mpg  
394          82  europe                          vw pickup     0.930851  
343          81   japan                     toyota starlet 

In [None]:
# 문제16 - origin이 usa, europe인 데이터의 horsepower 표준편차값의 차이를 절대값으로 소수 첫째 자리까지 출력하시오.
import numpy as np
# df.info()
# df.describe()
df.head(5)
# df['origin'].unique()

usa = np.std( df['horsepower'][df['origin'] == 'usa'])
europe = np.std(df['horsepower'][df['origin'] == 'europe'])
print('usa : ', usa, ', europe : ', europe)

result = round(abs(usa - europe), 1)
print(result)

usa :  39.53769039467097 , europe :  20.119473663887472
19.4


In [None]:
# 문제17 - origin별로 그룹화하여 weight 평균값을 산출하고, 평균값이 높은 그룹의 weight 3사분위 수 값을 구하시오.
# df.head(3)
# df.isnull().sum()
# df.dropna(inplace = True)
# df.isnull().sum()
result = df.groupby('origin')['weight'].mean().reset_index()
# print(result)

result.sort_values(by = 'weight', ascending = False)    # europe
print(result)

result = df['weight'][df['origin'] == 'europe'].quantile(0.75)
print(result)

   origin       weight
0  europe  2423.300000
1   japan  2221.227848
2     usa  3361.931727
2769.75


In [None]:
# 문제18 - name별로 그룹화하여, mpg 중앙값을 산출하고, 중앙값이 높은 그룹의 mpg 1사분위 수 값을 구하시오.
# df.info()
# df.isnull().sum()

result = df.groupby('name')['mpg'].median().reset_index()
#print(result)

result2 = result.sort_values(by = 'mpg', ascending = False)
print(result2.iloc[0])

print(df['mpg'][df['name'] == 'mazda glc'].quantile(0.25))

name    mazda glc
mpg          46.6
Name: 176, dtype: object
46.6


In [None]:
# 문제19 - horsepower 상위 10번째 값으로 상위 10개 값을 변환한 후, horsepower가 150 이상인 데이터를 추출하여, horsepower의 평균값을 반올림하여 정수로 출력하시오.
# df.head(3)

df1 = df.copy()
df1 = df1.sort_values(by='horsepower', ascending = False, ignore_index = True)
# print(df1.head(12))

a = df1.loc[9, 'horsepower']
print(a)

df1.loc[0:9, 'horsepower'] = a
# print(df1.head(12))

result = df1['horsepower'][df1['horsepower'] >= 150].mean()
print(round(result))

208.0
171


In [None]:
# 문제20 - name 변수에 mercury문구가 포함된 자동차의 mpg평균값을 정수로 출력하시오.
# df1.head(5)
df1 = df.copy()

cond1 = df1['name'].str.contains('mercury')
print(df1[['mpg', 'name']][cond1])

result = df1['mpg'][cond1].mean()
print(int(result))

      mpg                      name
49   23.0        mercury capri 2000
67   11.0           mercury marquis
90   12.0  mercury marquis brougham
113  21.0          mercury capri v6
154  15.0           mercury monarch
224  15.0   mercury cougar brougham
251  20.2      mercury monarch ghia
259  20.8            mercury zephyr
281  19.8          mercury zephyr 6
287  16.5     mercury grand marquis
379  36.0            mercury lynx l
19


In [None]:
# 문제21 - horsepower와 weight의 상관계수를 소수 둘째자리 까지 구하시오
df1 = df.copy()

result = df1['horsepower'].corr(df1['weight'])
print(round(result, 2))

0.86


In [None]:
# 문제22 - mpg와 가장 상관관계가 높은 수치형 변수를 구하시오.
df1 = df.copy()
# df1.info()
df1 = df1[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']]

corr_matrix = df1.corr()
# print(corr_matrix)

# Case1
result = corr_matrix['mpg'].sort_values(ascending = False)
print(result.index[1])      # model_year

# Case2
# result = corr_matrix['mpg'].drop(['mpg'])
# print(result)
# print(result.idxmax())      # model_year

model_year


In [None]:
# 문제23 - mpg와 음의 상관관계가 가장 높은 수치형 변수를 구하여, 해당 변수의 중앙값을 구하시오
df1 = df.copy()
df1 = df1[['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']]
corr_matrix = df1.corr()
print(corr_matrix)
corr_matrix = corr_matrix['mpg'].drop('mpg')
# corr_matrix = corr_matrix.sort_values()
print(corr_matrix)
# # print(corr_matrix.idxmin())   # weight

# a = corr_matrix.idxmin()

# result = df1[a].median()
# print(result)

                   mpg  cylinders  displacement  horsepower    weight  \
mpg           1.000000  -0.775396     -0.804203   -0.778427 -0.831741   
cylinders    -0.775396   1.000000      0.950721    0.842983  0.896017   
displacement -0.804203   0.950721      1.000000    0.897257  0.932824   
horsepower   -0.778427   0.842983      0.897257    1.000000  0.864538   
weight       -0.831741   0.896017      0.932824    0.864538  1.000000   
acceleration  0.420289  -0.505419     -0.543684   -0.689196 -0.417457   
model_year    0.579267  -0.348746     -0.370164   -0.416361 -0.306564   

              acceleration  model_year  
mpg               0.420289    0.579267  
cylinders        -0.505419   -0.348746  
displacement     -0.543684   -0.370164  
horsepower       -0.689196   -0.416361  
weight           -0.417457   -0.306564  
acceleration      1.000000    0.288137  
model_year        0.288137    1.000000  
cylinders      -0.775396
displacement   -0.804203
horsepower     -0.778427
weight      

## 5.날짜처리

In [None]:
import pandas as pd
df = pd.DataFrame({
    '날짜':['20220102', '20220105', None, '20230127', '20220203', '20220205', '20220915', '20230301', '20230203', '20230205', '20230315', '20230515'],
    '제품' : ['사과', '딸기', None, '딸기', '사과', None, '사과', '딸기', '사과', '딸기', '사과', '사과' ],
    '판매량' : [3, None, 5, 10, 10, 10, 15, 15, 20, None, 30, 40],
    '개당수익' : [300, 400, 500, 600, 400, 500, 500, 600, 600, 700, 600, 600]
})

In [None]:
# 문제24 - 판매량 컬럼의 결측치를 최소값으로 대체하고 결측치가 있을 때와 최소값으로 대체했을때 평균값의 차이를 절대값 정수로 출력하시오
df1 = df.copy()
df1['판매량'] = df1['판매량'].fillna(df1['판매량'].min())
# print(df1['판매량'])

a = df['판매량'].mean()     # 결측값 있을때
b = df1['판매량'].mean()    # 결측값 최소값으로 대체

print(int(abs(a - b)))

2


In [None]:
# 문제25 - 22년 1분기 사과 매출액을 구하시오. (매출액 = 판매량 * 개당수익)
df1 = df.copy()

df1['매출액'] = df1['판매량'] * df1['개당수익']
# print(df1.head(10))

df1['날짜'] = pd.to_datetime(df1['날짜'])
# df1.info()

# df1['날짜'].dt.year / month / day
df1['년'] = df1['날짜'].dt.year
df1['월'] = df1['날짜'].dt.month
df1['일'] = df1['날짜'].dt.day
df1['분기'] = df1['날짜'].dt.quarter

cond1 = df1['년'] == 2022
cond2 = df1['분기'] == 1
cond3 = df1['제품'] == '사과'

print(df1['매출액'][cond1 & cond2 & cond3].sum())


4900.0


In [None]:
# 문제26 - 22년과 23년 촐 매출액의 차이를 절대 값으로 구하시오.
df1 = df.copy()

df1['매출액'] = df1['판매량'] * df1['개당수익']
df1['날짜'] = pd.to_datetime(df1['날짜'])

df1['년'] = df1['날짜'].dt.year

sal_22 = df1['매출액'][df1['년'] == 2022].sum()
sal_26 = df1['매출액'][df1['년'] == 2026].sum()

print(abs(sal_22 - sal_26))

17400.0


In [None]:
help(pd)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# 문제27 - 23년  총 매출액이 큰 제품의 23년 판매량을 구하시오
df1 = df.copy()

df1['날짜'] = pd.to_datetime(df1['날짜'])
df1['매출액'] = df1['판매량'] * df1['개당수익']
df1['년'] = df1['날짜'].dt.year

cond1 = df1['년'] == 2022
cond2 = df1['년'] == 2023

# 그룹한 '제품' 이 index가 됨.
pname = df1[cond1].groupby('제품')['매출액'].sum().idxmax()
print(pname)
cond3 = df1['제품'] == pname

result = df1['판매량'][cond2 & cond3].sum()
print(result)

사과
90.0


In [None]:
# 문제28 - 매출액이  4천원 초과, 1만원 미만인 데이터 수를 출력하시오
df1 = df.copy()

df1['매출액'] = df1['판매량'] * df1['개당수익']

cond1 = df1['매출액'] > 4000
cond2 = df1['매출액'] < 10000

print(df1[:][cond1 & cond2].shape[0])

4


In [None]:
df1 = df.copy()

# df에 시간 데이터 임의로 입력
time = pd.date_range('2023-07-24 12:00:00', '2023-07-25 14:50:30', periods = 12)
# print(time)

df1['time'] = time

df1 = df1.drop('날짜', axis = 1)
df1

Unnamed: 0,제품,판매량,개당수익,time
0,사과,3.0,300,2023-07-24 12:00:00.000000000
1,딸기,,400,2023-07-24 14:26:24.545454545
2,,5.0,500,2023-07-24 16:52:49.090909090
3,딸기,10.0,600,2023-07-24 19:19:13.636363636
4,사과,10.0,400,2023-07-24 21:45:38.181818181
5,,10.0,500,2023-07-25 00:12:02.727272727
6,사과,15.0,500,2023-07-25 02:38:27.272727272
7,딸기,15.0,600,2023-07-25 05:04:51.818181818
8,사과,20.0,600,2023-07-25 07:31:16.363636363
9,딸기,,700,2023-07-25 09:57:40.909090909


In [None]:
# 문제29 - 23년7월24일 15 ~ 23시 사이 전체 제품 판매량을 구하시오

cond1 = df1['time'].between('2023-07-24 15:00', '2023-07-24 23:00')

result = df1['판매량'][cond1].sum()
print(result)

25.0


In [None]:
# 문제30 - 7월24일 12:00 ~ 21:00 까지의 딸기의 매출액 총합을 구하시오.
df1['매출액'] = df1['판매량'] * df1['개당수익']
cond1 = df1['time'].between('2023-07-24 15:00', '2023-07-24 23:00')
cond2 = df1['제품'] == '딸기'

result = df1['매출액'][cond1 & cond2].sum()
print(result)

6000.0


In [None]:
df1 = df.copy()
# df1['제품'] = df1['제품'].dropna(inplace=True)
df1 = df1.dropna()
# print(df1)

cond1 = df1['제품'].str.contains('딸')

result = df1[cond1]
print(result)

         날짜  제품   판매량  개당수익
3  20230127  딸기  10.0   600
7  20230301  딸기  15.0   600
