# 1. Logistic Regression using IMDB review comments
 - DATA:https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview

##### 0) 라이브러리 불러오기

(1) 기본 라이브러리

In [1]:
import pandas as pd

(2) 딥러닝 라이브러리

In [2]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

##### 1) 데이터 불러오기

In [3]:
df = pd.read_csv("./data/labeledTrainData.tsv",sep='\t')

In [4]:
df

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


In [5]:
df = df[:200]

In [6]:
df.head(2)

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."


##### 2) 데이터 전처리

(1) 문장 개수 추출

In [7]:
df["sentence_len"] = df["review"].apply(lambda x: len(x.split(".")))
#df["content"].apply(lambda x: len(x.split("."))).value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["sentence_len"] = df["review"].apply(lambda x: len(x.split(".")))


(2) Corpus 추출

In [8]:
corpus = df['review']

(3) TF-IDF 추출

In [9]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

In [10]:
vectorizer

TfidfVectorizer()

In [11]:
vectorizer.get_feature_names()



['000',
 '10',
 '100',
 '101',
 '11',
 '117',
 '12',
 '13th',
 '14',
 '145',
 '147',
 '15',
 '16',
 '1600',
 '166',
 '17',
 '18',
 '1800',
 '1880s',
 '1930s',
 '1933',
 '1938',
 '1939',
 '1950',
 '1950s',
 '1951',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1964',
 '1965',
 '1968',
 '1970',
 '1971',
 '1973',
 '1974',
 '1975',
 '1976',
 '1978',
 '1980',
 '1980s',
 '1982',
 '1983',
 '1983s',
 '1984',
 '1985',
 '1986',
 '1988',
 '1991',
 '1992',
 '1996',
 '1998',
 '1999',
 '19th',
 '20',
 '2000',
 '2002',
 '2004',
 '2006',
 '2007',
 '2022',
 '2038',
 '21st',
 '29',
 '2d',
 '30',
 '300',
 '3000',
 '30s',
 '34',
 '35',
 '3d',
 '40',
 '40s',
 '45',
 '4th',
 '50',
 '50s',
 '60',
 '600',
 '64',
 '6th',
 '70',
 '70s',
 '71',
 '747',
 '75',
 '79',
 '80',
 '80s',
 '87',
 '88',
 '89',
 '90',
 '90s',
 '93',
 '95',
 '99',
 '_have_',
 'aag',
 'abandon',
 'abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abominable',
 'abomination',
 'abortion',
 'abound',
 'about',
 'above',
 'ab

In [12]:
type(X)

scipy.sparse._csr.csr_matrix

In [13]:
print(X.shape)

(200, 7221)


In [14]:
X

<200x7221 sparse matrix of type '<class 'numpy.float64'>'
	with 26716 stored elements in Compressed Sparse Row format>

In [15]:
X.todense()

matrix([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.05429101, 0.02896677, 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.05084889, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [16]:
X.todense()[0].tolist()[0]

[0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.03884727267012371,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.03391634339884525,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0

In [17]:
df

Unnamed: 0,id,sentiment,review,sentence_len
0,5814_8,1,With all this stuff going down at the moment w...,21
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",17
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,21
3,3630_4,0,It must be assumed that those who praised this...,9
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,10
...,...,...,...,...
195,8807_9,1,This is a collection of documentaries that las...,11
196,12148_10,1,"This movie has a lot of comedy, not dark and G...",4
197,10771_2,0,"Have not watched kids films for some years, so...",15
198,6766_3,0,You probably heard this phrase when it come to...,11


(4) Content Vector 추출

In [18]:
df["review_vector"] = X.todense().tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["review_vector"] = X.todense().tolist()


In [19]:
df['review_vector']

0      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
1      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2      [0.054291008404015635, 0.028966765548706894, 0...
3      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
                             ...                        
195    [0.06318363232414913, 0.03371139195701426, 0.0...
196    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
197    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
198    [0.0, 0.05084888721568689, 0.0, 0.0, 0.0, 0.0,...
199    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
Name: review_vector, Length: 200, dtype: object

In [20]:
df['review_vector'].values.tolist()

[[0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.03884727267012371,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.03391634339884525,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  

In [21]:
torch.tensor(df["review_vector"].values.tolist()).to(torch.float32)

tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0543, 0.0290, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0508, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]])

In [22]:
for i in df['review_vector']:
    print(len(i))

7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221
7221


In [23]:
df['sentiment']

0      1
1      1
2      0
3      0
4      1
      ..
195    1
196    1
197    0
198    0
199    1
Name: sentiment, Length: 200, dtype: int64

##### 4) DataLoader

(1) train 데이터 불러오기

In [24]:
x_train = torch.tensor(df["review_vector"].values.tolist()).to(torch.float32)
y_train = torch.tensor(df["sentiment"].values).reshape(-1, 1).to(torch.float32)

In [25]:
x_train

tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0543, 0.0290, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0508, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]])

In [26]:
x_train.shape

torch.Size([200, 7221])

In [27]:
y_train

tensor([[1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
      

(2) CustomDataset

In [28]:
class CustomDataset(Dataset):
    def __init__(self, x_data, y_data):
        self.x_data = x_data
        self.y_data = y_data

    def __len__(self):
        return len(self.x_data)
    
    def __getitem__(self, idx):
        x = self.x_data[idx]
        y = torch.FloatTensor(self.y_data[idx])
        
        return x, y
    
dataset = CustomDataset(x_train, y_train)

(3) DataLoader

In [29]:
dataloader = DataLoader(dataset, batch_size = 10, shuffle = True)

In [30]:
dataloader_iter = iter(dataloader)
batch_idx, samples = next(dataloader_iter)

In [31]:
batch_idx, samples

(tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         ...,
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0391, 0.0000,  ..., 0.0790, 0.0000, 0.0000]]),
 tensor([[0.],
         [0.],
         [1.],
         [1.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.],
         [0.]]))

##### 5) 모델 정의 (BinaryClassificationModel)

In [32]:
num_features = len(df['review_vector'][0])
num_classes = 1

In [33]:
class BinaryClassificationModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features = num_features, out_features = num_classes)
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x):
        return self.sigmoid(self.linear(x))

model = BinaryClassificationModel()

##### 6) Optimizer 정의

In [34]:
optimizer = optim.SGD(model.parameters(), lr=2, momentum=0.9)

##### 7) 모델 학습 + 성능 평가 (과제)

In [35]:
# (1) Epoch 순회
nb_epochs = 20
for epoch in range(nb_epochs + 1):
    for batch_idx, samples in enumerate(dataloader):
        # 1] train 데이터 불러오기
        x_train, y_train = samples
        # 2] 모델 예측
        y_pred = model(x_train)
        # 3] Cost 계산
        cost = F.binary_cross_entropy(y_pred, y_train)
        
        # 4] 예측값 = 최대 확률
        prediction = y_pred >= torch.FloatTensor([0.5])
        # 5] Accuracy 계산
        check_prediction = prediction.float() == y_train
        accuracy = check_prediction.sum().item() / len(check_prediction)
        
        # 6] 모델 역전파
        optimizer.zero_grad()
        cost.backward()
        optimizer.step()

        # 7] 결과 출력
        print('Epoch {:4d}/{} Batch {}/{} Cost:{:.6f} Accruacy:{:.6f}'.format(epoch, nb_epochs, batch_idx+1, len(dataloader), cost.item(), accuracy * 100)) #  hypothesis: {} , pred.squeeze().detach()

Epoch    0/20 Batch 1/20 Cost:0.692325 Accruacy:50.000000
Epoch    0/20 Batch 2/20 Cost:0.691487 Accruacy:50.000000
Epoch    0/20 Batch 3/20 Cost:0.696775 Accruacy:40.000000
Epoch    0/20 Batch 4/20 Cost:0.638821 Accruacy:80.000000
Epoch    0/20 Batch 5/20 Cost:0.797380 Accruacy:50.000000
Epoch    0/20 Batch 6/20 Cost:0.629639 Accruacy:70.000000
Epoch    0/20 Batch 7/20 Cost:0.953918 Accruacy:40.000000
Epoch    0/20 Batch 8/20 Cost:0.748430 Accruacy:40.000000
Epoch    0/20 Batch 9/20 Cost:0.822073 Accruacy:40.000000
Epoch    0/20 Batch 10/20 Cost:1.041032 Accruacy:30.000000
Epoch    0/20 Batch 11/20 Cost:0.631311 Accruacy:60.000000
Epoch    0/20 Batch 12/20 Cost:0.668763 Accruacy:60.000000
Epoch    0/20 Batch 13/20 Cost:0.878964 Accruacy:40.000000
Epoch    0/20 Batch 14/20 Cost:0.743613 Accruacy:50.000000
Epoch    0/20 Batch 15/20 Cost:0.613242 Accruacy:70.000000
Epoch    0/20 Batch 16/20 Cost:0.591020 Accruacy:60.000000
Epoch    0/20 Batch 17/20 Cost:0.694741 Accruacy:50.000000
Epoch 