### 코로나 데이터
* 코로나 데이터 처리
    * 기간 : 2020-01-20 ~ 2021-08-31
* 데이터 프레임 생성
    * 전체 데이터 프레임                    -> corona_df
    * 주말 제거된 데이터 프레임(토, 일 제거)    -> corona_coin_df
* 각각의 csv 파일 생성
    * 전체 데이터           -> corona.csv        -> 코로나 vs 뉴스 상관관계 분석에 사용
    * 주말 제거된 데이터      -> corona_coin.csv   -> 코로나 vs 주식, 코인 상관관계 분석에 사용

In [50]:
import pandas as pd
from datetime import datetime

In [51]:
xlsx = pd.read_excel("./EXCEL/코로나바이러스감염증-19_확진환자_발생현황_210928.xlsx")
xlsx.head(6)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,,,,,
1,,,,,
2,,,,,
3,일자,계(명),국내발생(명),해외유입(명),사망(명)
4,누적(명),305842,291457,14385,2463.5
5,2020-01-20 00:00:00,1,-,1,-


In [52]:
xlsx.shape

(623, 5)

In [53]:
xlsx.isnull().sum()

Unnamed: 0    3
Unnamed: 1    3
Unnamed: 2    3
Unnamed: 3    3
Unnamed: 4    3
dtype: int64

In [54]:
xlsx.loc[3]

Unnamed: 0         일자
Unnamed: 1       계(명)
Unnamed: 2    국내발생(명)
Unnamed: 3    해외유입(명)
Unnamed: 4      사망(명)
Name: 3, dtype: object

In [55]:
col = xlsx.loc[3]
columns = []
for i in col :
    columns.append(i)
    print(i)

일자
계(명)
국내발생(명)
해외유입(명)
사망(명)


In [59]:
xlsx.columns = columns
xlsx.head(6)

Unnamed: 0,일자,계(명),국내발생(명),해외유입(명),사망(명)
0,,,,,
1,,,,,
2,,,,,
3,일자,계(명),국내발생(명),해외유입(명),사망(명)
4,누적(명),305842,291457,14385,2463.5
5,2020-01-20 00:00:00,1,-,1,-


In [60]:
# 상단 3개 행 제거
xlsx_tmp = xlsx.drop([0, 1, 2, 3, 4])

In [61]:
xlsx_tmp.reset_index(drop = True, inplace = True)
xlsx_tmp.head(3)

Unnamed: 0,일자,계(명),국내발생(명),해외유입(명),사망(명)
0,2020-01-20 00:00:00,1,-,1,-
1,2020-01-21 00:00:00,0,-,-,-
2,2020-01-22 00:00:00,0,-,-,-


In [62]:
xlsx_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618 entries, 0 to 617
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   일자       618 non-null    object
 1   계(명)     618 non-null    object
 2   국내발생(명)  618 non-null    object
 3   해외유입(명)  618 non-null    object
 4   사망(명)    618 non-null    object
dtypes: object(5)
memory usage: 24.3+ KB


### 코로나 확진자 수 데이터 타입 변경

In [63]:
corona = xlsx_tmp[["일자", "계(명)"]]
corona["계(명)"] = corona["계(명)"].astype("int")
corona.head(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corona["계(명)"] = corona["계(명)"].astype("int")


Unnamed: 0,일자,계(명)
0,2020-01-20 00:00:00,1
1,2020-01-21 00:00:00,0
2,2020-01-22 00:00:00,0


In [64]:
corona.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618 entries, 0 to 617
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   일자      618 non-null    object
 1   계(명)    618 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 9.8+ KB


### 코로나 날짜 데이터 타입 변경

In [65]:
corona["일자"] = pd.to_datetime(corona["일자"], unit = "ns")
# corona["일자"].apply(lambda _ : datetime.strptime(_, "%Y-%m-%d %H:%M:%S"))
# corona_df["일자"] = corona_df["일자"].apply(lambda _ : datetime.strptime(_, "%Y-%m-%d %H:%M:%S"))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  corona["일자"] = pd.to_datetime(corona["일자"], unit = "ns")


In [67]:
corona.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 618 entries, 0 to 617
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   일자      618 non-null    datetime64[ns]
 1   계(명)    618 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 9.8 KB


### 날짜가 2021년 9월 1일 이전인 데이터만 추출
* 2020년 1월 20일 ~ 2021년 8월 31일 데이터

In [69]:
corona = corona.loc[corona["일자"] < '2021-09-01', :]
corona.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 590 entries, 0 to 589
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   일자      590 non-null    datetime64[ns]
 1   계(명)    590 non-null    int64         
dtypes: datetime64[ns](1), int64(1)
memory usage: 13.8 KB


In [71]:
corona.tail()

Unnamed: 0,일자,계(명)
585,2021-08-27,1837
586,2021-08-28,1791
587,2021-08-29,1619
588,2021-08-30,1485
589,2021-08-31,1370


### 주말 제거된 데이터 생성

In [72]:
corona["일자"][6].strftime("%A")

'Sunday'

In [73]:
corona["일자"].dt.strftime("%A")

0         Monday
1        Tuesday
2      Wednesday
3       Thursday
4         Friday
         ...    
585       Friday
586     Saturday
587       Sunday
588       Monday
589      Tuesday
Name: 일자, Length: 590, dtype: object

In [74]:
corona["일자"].dt.weekday # 0 : Monday, 6 : Sunday

0      0
1      1
2      2
3      3
4      4
      ..
585    4
586    5
587    6
588    0
589    1
Name: 일자, Length: 590, dtype: int64

In [75]:
# 5(토요일) : 84, 6(일요일) : 84
# 590 - 84 - 84 = 422
corona_kospi = corona.loc[(corona["일자"].dt.weekday != 5) & (corona["일자"].dt.weekday != 6), :]
corona_kospi.shape

(422, 2)

In [76]:
corona_kospi.reset_index(drop = True, inplace = True)
corona_kospi.tail()

Unnamed: 0,일자,계(명)
417,2021-08-25,2154
418,2021-08-26,1882
419,2021-08-27,1837
420,2021-08-30,1485
421,2021-08-31,1370


### DataFrame

In [77]:
print("corona shape :", corona.shape)
print("corona_kospi shape : ", corona_kospi.shape)

corona shape : (590, 2)
corona_kospi shape :  (422, 2)


In [78]:
corona.to_csv("./CSV/01_corona.csv", index = False)
corona_kospi.to_csv("./CSV/02_corona_without_weekend.csv", index = False)