#RQ : FakeID Detection in Instagram
- ### Data Sampling : SMOTE (Oversampling)
- ### Data Partitioning : train, validation, test

*vivino_preprocessing.html(ipynb) 에서 Data Sampling 기법을 사용하지 않아서 새로운 Dataset에서 적용해보았습니다.

##Google Drive Mounting

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##0. Data Description
- 해당 데이터셋의 Dtype을 확인하면 모두 전처리에 적합한 float 혹은 int type인 것을 확인 가능하다.
- FakeID Detection이라는 목적에 따라 'is_fake' Column을 추출하여 별도의 Target Variable "y"로 설정한다.
- dataset의 이름은 fake_id.csv로 저장했다.

https://www.kaggle.com/datasets/rezaunderfit/instagram-fake-and-real-accounts-dataset

In [None]:
import pandas as pd 

train_file = '/content/drive/Shareddrives/22-1 데이터마이닝/fake_id.csv'
df = pd.read_csv(train_file)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 785 entries, 0 to 784
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   edge_followed_by      785 non-null    float64
 1   edge_follow           785 non-null    float64
 2   username_length       785 non-null    int64  
 3   username_has_number   785 non-null    int64  
 4   full_name_has_number  785 non-null    int64  
 5   full_name_length      785 non-null    int64  
 6   is_private            785 non-null    int64  
 7   is_joined_recently    785 non-null    int64  
 8   has_channel           785 non-null    int64  
 9   is_business_account   785 non-null    int64  
 10  has_guides            785 non-null    int64  
 11  has_external_url      785 non-null    int64  
 12  is_fake               785 non-null    int64  
dtypes: float64(2), int64(11)
memory usage: 79.9 KB


In [None]:
df.head()

Unnamed: 0,edge_followed_by,edge_follow,username_length,username_has_number,full_name_has_number,full_name_length,is_private,is_joined_recently,has_channel,is_business_account,has_guides,has_external_url,is_fake
0,0.001,0.257,13,1,1,13,0,0,0,0,0,0,1
1,0.0,0.958,9,1,0,0,0,1,0,0,0,0,1
2,0.0,0.253,12,0,0,0,0,0,0,0,0,0,1
3,0.0,0.977,10,1,0,0,0,0,0,0,0,0,1
4,0.0,0.321,11,0,0,11,1,0,0,0,0,0,1


In [None]:
y = df.pop('is_fake')
y = pd.DataFrame(y)
y.describe()

Unnamed: 0,is_fake
count,785.0
mean,0.881529
std,0.323371
min,0.0
25%,1.0
50%,1.0
75%,1.0
max,1.0


In [None]:
x = df
x.describe()

Unnamed: 0,edge_followed_by,edge_follow,username_length,username_has_number,full_name_has_number,full_name_length,is_private,is_joined_recently,has_channel,is_business_account,has_guides,has_external_url
count,785.0,785.0,785.0,785.0,785.0,785.0,785.0,785.0,785.0,785.0,785.0,785.0
mean,0.002223,0.401606,11.630573,0.644586,0.109554,6.129936,0.184713,0.361783,0.0,0.073885,0.001274,0.06242
std,0.036105,0.293845,3.284329,0.478944,0.312532,6.943903,0.388312,0.480823,0.0,0.261751,0.035692,0.242072
min,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.135,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.336,11.0,1.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.623,13.0,1.0,0.0,11.0,0.0,1.0,0.0,0.0,0.0,0.0
max,1.0,1.0,26.0,1.0,1.0,30.0,1.0,1.0,0.0,1.0,1.0,1.0


## 1. Data Sampling
- Target 변수 y의 0과 1의 비율이 차이가 심하기 때문에 Data Sampling을 진행해야 한다.
- 현재 데이터의 양이 784개로 많지 않기 때문에 정보의 유실이 있는 Undersampling 방식 보다는 Oversampling 방식을 채택하였다.

In [None]:
y['is_fake'].value_counts()

1    692
0     93
Name: is_fake, dtype: int64

### Oversampling: SMOTE
Oversampling의 방식에는 다음과 같은 기법들이 존재한다.
- Resampling : 데이터 복제, 과적합의 위험성
- SMOTE : 임의로 한 소수범주 data에 인접한 K 개의 추가적인 소수범주 data를 찾고 그 중 하나를 random하게 선택하여 두 소수범주 data 사이에 새로운 data를 생성하는 방법
- Borderline-SMOTE : "경계 부분에 집중적으로 Oversampling을 하면 성능이 개선될 것"이라는 가설에서 시작하여, 경계선 근처의 소수범주 data를 증폭시키는 방법
- ADASYN : 경계들 중에서도 그 특성에 따라 샘플링하는 개수를 다르게 해보자는 방법으로, 주로 2개 이상의 정상범주 데이터와 불량범주 데이터가 존재 시에 SMOTE나 BorderlineSMOTE보다 더 좋은 성능을 보이는 것으로 알려진 방법

 현재는 FakeID의 Detection이 목적이기 때문에 일반적인 SMOTE 방식을 채택하였다.

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(random_state=312)
X_ad, Y_ad = oversample.fit_resample(x, y)
print(Y_ad.value_counts())

is_fake
0          692
1          692
dtype: int64


In [None]:
X_ad.shape

(1384, 12)

In [None]:
Y_ad.shape

(1384, 1)

In [None]:
Y_ad['is_fake'].value_counts()

1    692
0    692
Name: is_fake, dtype: int64

# 2. Data Partitioning

In [None]:
# train:val:test = 6:2:2
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X_ad, Y_ad, test_size=0.2, random_state=1) 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

In [None]:
print(
    """{} : {} \n{} : {} \n{} : {} \n{} : {} 
    """.format("X_train", X_train.shape, "X_test", X_test.shape, "y_train", y_train.shape, "y_test", y_test.shape)
)

X_train : (830, 12) 
X_test : (277, 12) 
y_train : (830, 1) 
y_test : (277, 1) 
    
