<Part2 - 데이터 분석(분류)>
- 필요 패키지 import
- Data 불러오고, 간단한 EDA
- Data 전처리
- train/test 분리 -> 분석 수행
- 성능 평가

## DecisionTree Classifier(의사결정 나무 분류 모델)


### 1.필요 패키지 import

In [54]:
# package import
import numpy as np
import pandas as pd
import sklearn

In [55]:
# decisionTree - classifier
from sklearn.tree import DecisionTreeClassifier

# data split
from sklearn.model_selection import train_test_split

### 2.Data 불러오기 + EDA
- iris data

In [56]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/iris.csv')

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### 3.데이터 전처리
- 이상치, 결측치 대체 등
- 데이터 인코딩, 단위 환산, 정규화, 파생변수 생성 등

In [7]:
df['species'].unique()

array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [8]:
# 명목형 변수에 대하여 Label Encoding
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['species_le'] = le.fit_transform(df['species'])

# pandas의 replace를 이용해서도 가능함
# df['species'].replace({'setosa' : 0, 'versicolor' : 1, 'virginica':2},  inplace = True)

### 4.분석 데이터셋 준비
- train/test 분리
- 모델 적합

In [9]:
# X는 독립변수, y는 종속변수
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
y = df['species_le'] ## 라벨링 한 값 사용

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =11)

In [10]:
# 분리된 데이터셋 크기 확인
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


In [11]:
# 모델 적합
dt = DecisionTreeClassifier(random_state=11)
dt.fit(X_train, y_train) ## 분석 수행

In [12]:
# 학습이 완료된 dt에 대해 분류(예측) 수행
pred = dt.predict(X_test)
print(pred)

[2 2 1 1 2 0 1 0 0 1 1 1 1 2 2 0 2 1 2 2 1 0 0 1 0 0 2 1 0 1]


### 5.성능 평가
- 정확도
- confusion matrix
- report : 오차행렬에 기반한 평가지표



In [13]:
# 분류 결과와 실제 분류 결과 비교
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print(acc)

0.9333333333333333


In [14]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)
# 실제와 같게 예측 됨 -> 28개

array([[ 9,  0,  0],
       [ 0, 10,  0],
       [ 0,  2,  9]])

In [15]:
# 평가지표 표
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       0.83      1.00      0.91        10
           2       1.00      0.82      0.90        11

    accuracy                           0.93        30
   macro avg       0.94      0.94      0.94        30
weighted avg       0.94      0.93      0.93        30



✔ Precision (정밀도)

예측한 것 중에 진짜 맞은 비율
→ 예측한 것 중 얼마나 정확했나  

✔ Recall (재현율)

실제 맞아야 할 것 중에서 얼마나 맞췄나
→ 놓치지 않고 잘 찾았나

✔ F1-score (조화평균)

Precision과 Recall을 모두 고려한 점수
→ 불균형 데이터일 때 중요한 지표  
정밀도와 재현율의 조화평균, 둘 가 높아야 F1-score도 높게 나옴

✔ Support

해당 클래스가 Test Set에서 몇 개 있었는지
→ 단순 개수

## Parc1 - Titanic 데이터셋

### 1.필요 패키지 import

In [16]:
import numpy as np
import pandas as pd
import sklearn

# DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# train_test split
from sklearn.model_selection import train_test_split

### 2.Data 불러오기 + EDA

In [17]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/titanic.csv')

In [18]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [19]:
df.info() ## 결측치, 타입 등 확인

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [20]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### 3.데이터 전처리
- 나이 -> 평균 대체
- cabin -> 결측치 너무 많아서 분석에서 제외
- embarked -> 최빈값 대체
- 성별, embarked : LabelEncoder이용 -> 순서 생김

In [21]:

# 나이 평균 대체
age_mean = df['Age'].mean()
df['Age'].fillna(age_mean, inplace=True)
print(df['Age'].isna().sum())

0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(age_mean, inplace=True)


In [22]:
# cabin 컬럼 제외
# embarked 최빈값 대체
embarked_mode = df['Embarked'].mode()[0]
print(embarked_mode)
df['Embarked'].fillna(embarked_mode, inplace = True)
print(df['Embarked'].isna().sum())

S
0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace = True)


In [23]:
# sex, embarkde encoding
df['Sex'].unique()

from sklearn.preprocessing import LabelEncoder
df['Sex'] =  LabelEncoder().fit_transform(df['Sex'])
df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])

In [24]:
# 파생변수 생성
# SibSp(형재 또는 배우자 수), Parch(부모 또는 자녀 수)의 값을 더해서 FamilySize컬럼 생성
df['FamilySize'] = df['SibSp']+df['Parch']
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,2,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,0,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,2,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,2,1
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,2,0


### 4.분석 데이터셋 준비 및 분석 수행

In [25]:
# data split
X = df[['Pclass','Sex','Age','Fare', 'Embarked', 'FamilySize']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=11)

In [26]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(712, 6)
(179, 6)
(712,)
(179,)


In [27]:
dt = DecisionTreeClassifier(random_state=11)
dt.fit(X_train, y_train) ## 학습 수행

In [28]:
# test set으로 예측 수행
pred = dt.predict(X_test)
print(pred)

[1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 0
 0 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1
 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0
 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0
 1 0 0 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 0]


### 5.성능평가
- 0 : 사망, 1: 생존

In [29]:
# confusion matrix 통해서 모델 성능 확인
from sklearn.metrics import confusion_matrix

mat = confusion_matrix(y_test, pred)
print(mat)
## 정밀도(precision) = (98/(98+18)) : 예측이 참인 것 중에서 실제 참인 비율
## recall = (98/(98+20)) : 실제 참인 것 중에서 참으로 예측한 비율

[[98 20]
 [18 43]]


In [30]:
# report 확인
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

           0       0.84      0.83      0.84       118
           1       0.68      0.70      0.69        61

    accuracy                           0.79       179
   macro avg       0.76      0.77      0.77       179
weighted avg       0.79      0.79      0.79       179



In [31]:
# accuracy 확인
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

0.7877094972067039


## KNN - iris
- 지도학습
- 데이터로부터 거리가 가까운 K개의 데이터 참조 후 분류
- 사전에 데이터 표준화 필수!

### 1.필요 패키지 import

In [32]:
# 기본 패키지 import 생략 (모델 분리 포함)
# KNN 분류를 위한 패키지
from sklearn.neighbors import KNeighborsClassifier

### 2.Data 불러오기 + EDA
- EDA 생략

In [33]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/iris.csv')

### 3.데이터 전처리
- KNN 사용을 위해서는 데이터 스케일링 필수
- Min-Max 정규화

In [34]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['sepal_length']] = scaler.fit_transform(df[['sepal_length']])
df[['sepal_width']] = scaler.fit_transform(df[['sepal_width']])
df[['petal_length']] = scaler.fit_transform(df[['petal_length']])
df[['petal_width']] = scaler.fit_transform(df[['petal_width']])
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0.222222,0.625,0.067797,0.041667,setosa
1,0.166667,0.416667,0.067797,0.041667,setosa
2,0.111111,0.5,0.050847,0.041667,setosa
3,0.083333,0.458333,0.084746,0.041667,setosa
4,0.194444,0.666667,0.067797,0.041667,setosa


### 4.분석 데이터셋 준비 및 모델 적합

In [35]:
# 데이터셋 분리
X = df[['sepal_length','sepal_width', 'petal_length', 'petal_width']]
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11, test_size=0.2)

In [36]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(120, 4)
(30, 4)
(120,)
(30,)


In [37]:
# KNeighborsClassifier 객체 생성
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)

In [38]:
pred = knn.predict(X_test)
print(pred)

['virginica' 'virginica' 'versicolor' 'versicolor' 'virginica' 'setosa'
 'versicolor' 'setosa' 'setosa' 'versicolor' 'versicolor' 'versicolor'
 'versicolor' 'virginica' 'virginica' 'setosa' 'virginica' 'versicolor'
 'virginica' 'virginica' 'versicolor' 'setosa' 'setosa' 'versicolor'
 'setosa' 'setosa' 'virginica' 'versicolor' 'setosa' 'versicolor']


### 5.성능평가

In [39]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print(acc)

0.9333333333333333


In [40]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, pred))

[[ 9  0  0]
 [ 0 10  0]
 [ 0  2  9]]


In [41]:
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00         9
  versicolor       0.83      1.00      0.91        10
   virginica       1.00      0.82      0.90        11

    accuracy                           0.93        30
   macro avg       0.94      0.94      0.94        30
weighted avg       0.94      0.93      0.93        30



## SVM - titanic

### 1.필요 패키지 import

In [42]:
# 기본 패키지 import 생략
# svm 패키지만 import
from sklearn import svm

### 2.데이터 불러오기 + EDA
- EDA 생략

In [43]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/titanic.csv')

### 3.데이터 전처리
- Sex, Embarkde 컬럼 원핫인코딩

In [44]:
# age 평균값 대체
age_mean = df['Age'].mean()
df['Age'].fillna(age_mean, inplace=True)

# embarked 최빈값 대체
embarked_mode = df['Embarked'].mode()[0]
df['Embarked'].fillna(embarked_mode, inplace = True)

# 파생변수 생성
df['FamilySize'] = df['SibSp'] + df['Parch']

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(age_mean, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace = True)


In [45]:
# 원 핫 인코딩
oh_sex = pd.get_dummies(df['Sex']).astype(int)
df = pd.concat([df, oh_sex], axis=1)
oh_embarked = pd.get_dummies(df['Embarked']).astype(int)
df = pd.concat([df, oh_embarked], axis=1)

In [46]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,female,male,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0,1,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,1,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,0,1,0,0,1


### 4.분석 데이터셋 준비 및 수행

In [47]:
X = df[['Pclass', 'Age', 'Fare', 'FamilySize', 'female', 'male','C','Q','S']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10)

In [48]:
# 모델 적합을 위한 svm.SVC() 사용 - rbf 커널 사용
svm = svm.SVC(kernel='rbf')
svm.fit(X_train, y_train)

In [49]:
pred = svm.predict(X_test)
print(pred)

[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 1 1 0 0 0]


### 5.성능평가

In [50]:
from sklearn.metrics import classification_report
rpt = classification_report(y_test, pred)
print(rpt)

              precision    recall  f1-score   support

           0       0.71      0.96      0.82       174
           1       0.79      0.29      0.42        94

    accuracy                           0.72       268
   macro avg       0.75      0.62      0.62       268
weighted avg       0.74      0.72      0.68       268



In [51]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, pred))

0.7238805970149254


In [52]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)

array([[167,   7],
       [ 67,  27]])

### svm 커널 조정
- rbf
- linear C=0.1, gamma = 0.1
- rbf, C=0.1, gamma = 0.1

In [53]:
from sklearn import svm
sv1 = svm.SVC(kernel='linear', C=0.1, gamma = 0.1)
sv1.fit(X_train, y_train)

#예측
pred = sv1.predict(X_test)
# print(pred)

# 성능평가
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,pred)) ## 앞선 모델 보다 성능이 좋음

[0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1
 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 1 0 1 0 0 0 0 0 1 1 0 0 1 0 1
 0 1 0 0 0 0 1 1 0 1 0 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0
 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
 1 1 1 1 0 0 1 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1
 0 0 0 1 1 1 1 1 1]
0.8059701492537313


## Logistic Regression - iris
- sigmoid 함수의 출력값을 각 분류 항목에 속하게 될 확률 값으로 가짐
- 확률에 따라 가능성이 더 높은 범주에 속하는 이진 분류 모델


##### L2 - Lasso(라쏘), 기본값
##### L1 - Ridge(릿지)
규제가 필요함, overfitting 방지

### 1.필요 패키지 import

In [59]:
import pandas as pd
import numpy as np
import sklearn

# 로지스틱 회귀 분류를 위한 패키지 import
from sklearn.linear_model import LogisticRegression

# train / test split
from sklearn.model_selection import train_test_split

### 2.데이터 불러오기 + EDA

In [66]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/iris.csv')

### 3.데이터 전처리
- min-Max sclaer

In [67]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['sepal_length']] = scaler.fit_transform(df[['sepal_length']])
df[['sepal_width']] = scaler.fit_transform(df[['sepal_width']])
df[['petal_length']] = scaler.fit_transform(df[['petal_length']])
df[['petal_width']] = scaler.fit_transform(df[['petal_width']])

### 4.분석 데이터셋 준비 및 분석

In [69]:
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 11)

In [70]:
print(X_train.shape)

(120, 4)


In [71]:
# LogisticRegression 수행
lr = LogisticRegression() ## 로지스틱 모델 객체 생성
lr.fit(X_train, y_train)

In [74]:
pred = lr.predict(X_test)

### 5.성능 평가 및 시각화

In [77]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, pred)
print("accuracy", acc)

accuracy 0.8333333333333334


## RandomForest- titanic
- 빅분기에서는 아묻따 랜포라고 함
- Bagging을 통해 분류 및 회귀 수행하는 앙상블 기법


### 1.필요한 패키지 import

In [78]:
# RandomFroestClassifier import
from sklearn.ensemble import RandomForestClassifier

### 2.데이터 불러오기 + EDA
- EDA 생략

In [81]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/titanic.csv')

### 3.데이터 전처리

In [85]:
# df.info, df.describe, df.head 등 수행
# 결측치 대체 및 Sex, Embarked 변수에 대해 LabelEncoding

# age 평균값 대체
age_mean = df['Age'].mean()
df['Age'].fillna(age_mean, inplace=True)

# embarked 최빈값 대체
embarked_mode = df['Embarked'].mode()[0]
df['Embarked'].fillna(embarked_mode, inplace = True)

# 파생변수 생성
df['FamilySize'] = df['SibSp'] + df['Parch']

# 인코딩
from sklearn.preprocessing import LabelEncoder
df['Sex'] = LabelEncoder().fit_transform(df['Sex'])
df['Embarkde'] = LabelEncoder().fit_transform(df['Embarked'])

df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(age_mean, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(embarked_mode, inplace = True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Embarkde
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,1,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S,0,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,1,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S,0,2


### 4.분석 데이터셋 준비 및 분석



In [88]:
X = df[['Pclass', 'Sex','Age', 'Fare', 'Embarkde', 'FamilySize']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

In [91]:
# randomforest
rf = RandomForestClassifier(n_estimators = 50, max_depth = 3, random_state=20)
rf.fit(X_train, y_train)

In [95]:
pred = rf.predict(X_test)
print(pred)

[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0
 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1
 0 0 1 0 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0
 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0]


### 5.성능평가

In [99]:
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, y_test)
print("accuracy_score: ", acc)

accuracy_score:  0.8603351955307262


## parc - iris

In [100]:
# 필요한 패키지 import
import pandas as pd
import numpy as np
import sklearn
# 모델 적합
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# 성능 평가
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

### 2.데이터 EDA

In [101]:
df = pd.read_csv('https://raw.githubusercontent.com/YoungjinBD/dataset/main/iris.csv')

In [102]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [103]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### 데이터 전처리

In [105]:
#LabelEncoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### 4. 분석 데이터셋 준비 및 분석

In [107]:
X = df[['sepal_length','sepal_width',	'petal_length',	'petal_width']]
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=10)

In [109]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [112]:
pred = rf.predict(X_test)
print(pred)

[1 2 0 1 0 1 1 1 0 1 1 2 1 0 0 2 1 0 0 0 2 2 2 0 1 0 1 1 1 2]


### 5. 모델 성능 평가

In [113]:
acc = accuracy_score(y_test, pred)
print("accuracy: ", acc)
print(classification_report(y_test, pred))

accuracy:  1.0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00         7

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



###