<a href="https://colab.research.google.com/github/forrestpark/NLPwithDeepLearning/blob/main/Gender_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preperation

Importing necessary libraries, such as `pandas`, `gensim`, `torch` 

In [1]:
# for reading a dataset
import pandas as pd

# for pre-training an embedding
import gensim
from gensim.models import FastText

# for building and training neural networks
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.init as I
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split

In [2]:
if torch.cuda.is_available():
    print("using gpu for computation")
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

using gpu for computation


### Hyperparmeters

In [3]:
embeddings_dim = 512   # fasttext embedding size
hidden_size = 256
depth = 4

learning_rate = 0.00025

embeddings_epoch = 10 # fasttext training epochs 
num_epochs = 55
dropout = 0.8

n_classes = 2         # F/M
batch_size = 32
print_every = 5

# Corpus

Downloading dataset named [Gender by Name](https://archive.ics.uci.edu/ml/datasets/Gender+by+Name).

In [4]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00591/name_gender_dataset.csv

--2021-08-08 08:26:23--  https://archive.ics.uci.edu/ml/machine-learning-databases/00591/name_gender_dataset.csv
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3774591 (3.6M) [application/x-httpd-php]
Saving to: ‘name_gender_dataset.csv’


2021-08-08 08:26:23 (15.9 MB/s) - ‘name_gender_dataset.csv’ saved [3774591/3774591]



Using `pandas` `read.csv()` function to read csv files
By calling `DataFrame.head()` method, we may observe that the imported dataset consists of four fields, namely Name, Gender, Count, and Probability. The dataset set has both Name and Gender, thus it is possible that we pursue a supervised neural network learning.

In [5]:
data = pd.read_csv('name_gender_dataset.csv')
data.head()

Unnamed: 0,Name,Gender,Count,Probability
0,James,M,5304407,0.014517
1,John,M,5260831,0.014398
2,Robert,M,4970386,0.013603
3,Michael,M,4579950,0.012534
4,William,M,4226608,0.011567


To express each name as a vector, a word embedding model is necessary. Corpus for embedding model training is in the format of a list of sentences, and each name input would be a sentence in this case.


---


각 이름을 벡터로 표현하기 위해서는 단어 임베딩 모형이 필요하다.  임베딩 모형을 훈련하기 위한 코퍼스는 문장(단어들의 리스트)들의 리스트로 표현한다. 여기에서는 이름 하나를 문장 하나로 취급한다.

In [6]:
names = list(data['Name']) # list of names
names = [[name] for name in names] # list of lists

# Pre-training a word embedding

We hereby train a `FastText` model, which embeds each word in the given corpus as a 100-dimension vector.


---


주어진 코퍼스로부터 단어들을 각각 100차원 벡터로 임베딩하는 `FastText` 모형을 훈련시킨다. 훈련과 관련된 하이퍼패러미터들의 값은 이후 검증 집합에서의 성능을 살펴보면서 변경할 수 있다. `help(FastText)`로 알아보라.

In [7]:
embeddings = FastText(sentences=names, sg=1, min_count=1, workers=2,
                      size=embeddings_dim, min_n=1, max_n=5,
                      iter=embeddings_epoch)

# Building a neural network

Storing as variables the diemnsion of the word embedding vector (`d`) and the number of classes (`n_classes`) to determine the dimensions of the input layer and output layer.


---


입력층과 출력층의 차원을 결정하기 위해 단어 임베딩 벡터의 차원(`d`)과 분류할 범주의 개수(`n_classes`)를 변수로 저장한다.

Let us try 128 units for the hidden layer for experimentation's sake. We can always change the number of hidden layers as we examine the model's accuracy in the validation set.


---



은닉층 유닛 개수는 128로 해 보자. 이 값은 이후 검증 집합에서의 성능을 살펴보면서 변경할 수 있다.

Let us create a neural network with one input layer, one hidden layer and one output layer.

+  `fc1`:  Input layer --> hidden layer; ReLU for the activation function
+  `fc2`: Hidden layer --> output layer 

---
입력층, 은닉층, 출력층 각 한 개로 이루어진 신경망을 구성하자.

+  `fc1`:  입력층-->은닉층. 활성화함수로 ReLU를 사용한다.
+  `fc2`: 은닉층-->출력층.  

은닉층의 개수는 이후 검증 집합에서의 성능을 살펴보면서 변경할 수 있다.

In [8]:
class FFNN(nn.Module):
    def __init__(self, depth, input_dim, hidden_dim, output_dim, dropout):
        super().__init__()

        self.inp = nn.Linear(input_dim, hidden_dim)
        self.mid = nn.ModuleList(
            nn.Linear(hidden_dim, hidden_dim)
            for i in range(depth - 2))
        self.out = nn.Linear(hidden_dim, output_dim)
        
        self.norm = nn.ModuleList(
            nn.BatchNorm1d(hidden_dim)
            for i in range(depth - 1))
        
        self.dropout = nn.Dropout(dropout)

        I.kaiming_uniform_(self.inp.weight, nonlinearity="relu")
        for layer in self.mid:
            I.kaiming_uniform_(layer.weight, nonlinearity="relu")
        I.xavier_uniform_(self.out.weight)

    def forward(self, x: torch.Tensor):
        x = self.dropout(F.relu(self.norm[0](self.inp(x))))
        for mid, norm in zip(self.mid, self.norm[1:]):
            x = self.dropout(F.relu(norm(mid(x))))
        x = self.out(x)

        return x


Store the neural network as a variable `net`

---


신경망을 `net`이라는 변수로 저장한다.

In [9]:
net = FFNN(depth=depth,
           input_dim=embeddings_dim,
           hidden_dim=hidden_size,
           output_dim=n_classes,
           dropout=dropout)
net.to(device)
print(net)

FFNN(
  (inp): Linear(in_features=512, out_features=256, bias=True)
  (mid): ModuleList(
    (0): Linear(in_features=256, out_features=256, bias=True)
    (1): Linear(in_features=256, out_features=256, bias=True)
  )
  (out): Linear(in_features=256, out_features=2, bias=True)
  (norm): ModuleList(
    (0): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (dropout): Dropout(p=0.8, inplace=False)
)


입력 벡터를 실제로 신경망에 넣어서 출력 벡터를 뽑아 보자.

###  Creating an input word vector for the name 'Cyrill'

In [11]:
input = torch.tensor(embeddings.wv['Cyrill'], device=device) # edit this line

The input vector consists of one word vector in the 100th dimension, hence 100-dimension in total.

---


입력 벡터는 100차원짜리 단어 벡터 1개로 이루어져 있으므로 100차원이 된다.

In [12]:
print(input.size())

torch.Size([512])


We have to classes to which all inputs get classfied, hence a two-dimension vector for the output layer.

---


분류할 범주 목록은 {F, M}  두 가지로 이루어져 있으므로 출력 벡터는 2차원이 된다.

In [13]:
net.eval()
output = net(input.unsqueeze(0))
print(output.squeeze().size())

torch.Size([2])


# Training, validation & test datasets

Let us compose a dataset proper for the language model's objective. We create a new class by inheriting `torch.utils.data.Dataset` and initialize new methods `__len__()` and `__getitem__()`.

---


언어 모형의 목적에 맞는 데이터셋을 구성하자. `torch.utils.data.Dataset` 클래스를 상속하여 새 클래스를 만들고, `__len__()` 함수와 `__getitem__()` 함수를 새로 만들면 된다.

Defining the method `__getitem__()`. The value of the variable `label` should be 0 for female and 1 for male.

In [14]:
class NameDataset(Dataset):
    def __init__(self, data, embeddings):
        self.names = torch.tensor([embeddings.wv[name][0] for name in names],
                                  device=device)
        self.labels = torch.tensor(
            [0 if label == "F" else 1 for label in data['Gender']],
            dtype=int,
            device=device)

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
        return self.names[idx], self.labels[idx]

I was able to extract 147,269 units of data by constructing corpus from the dataset imported above and following the instructions above. Each datum consists of a 100-dimension input vector and one answer label.


---


위와 같은 형식으로 코퍼스에서 데이터셋을 구축한 결과 총 147269개의 데이터가 나왔다.  각 데이터는 100차원의 입력 벡터와 1개의 정답 레이블로 이루어져 있다.

In [15]:
dataset = NameDataset(data=data, embeddings=embeddings)
print(len(dataset))
print(dataset[0][:10])

147269
(tensor([-1.7508e-04, -5.8746e-05, -1.5219e-04,  8.0608e-05, -1.2216e-05,
        -1.7953e-04, -1.4539e-04, -3.3422e-04,  2.8326e-04, -3.2299e-05,
        -8.7766e-05, -1.1470e-04, -7.3571e-05, -5.8004e-05, -2.1175e-04,
        -7.6971e-06, -3.3010e-05, -6.7060e-06, -9.6139e-05, -8.8557e-05,
         3.2708e-05, -1.9642e-04, -2.3934e-04,  1.5144e-04, -9.0619e-06,
        -2.4235e-04,  2.7663e-04,  3.8188e-04, -2.1025e-04, -7.1081e-06,
        -6.3145e-05, -1.8900e-04,  4.0729e-04,  1.4545e-04, -1.9654e-04,
        -6.0527e-04,  2.5291e-04,  1.2834e-04, -2.1131e-04,  3.1024e-04,
        -9.7725e-05,  1.2811e-04,  1.2853e-04, -1.1167e-04, -3.0214e-04,
         8.2110e-05,  2.3547e-04, -6.6721e-05,  1.8263e-05,  9.0922e-05,
        -1.9627e-04,  4.0097e-05,  8.6036e-05,  1.8092e-04,  7.6385e-05,
         8.7623e-05, -1.3785e-04, -7.5645e-05,  3.7484e-04,  2.2766e-04,
         6.3754e-05, -1.6311e-04,  3.3755e-04,  2.6203e-04, -1.4301e-04,
         1.3966e-04,  1.9825e-04, -2.8852e-

In the code below,
1. I set batch size to 32,
2. split our data into three sets, using `torch.utils.data.random_split()`. Those three sets are the training set (120,000 entries),  the validation set (10,000 entries),  and the experimentation set(the rest),
3. and I also create a `DataLoader()` object that reads data from each set in the size of the batch size set above.

---


아래 코드에서는

1. 배치 사이즈를 32로 설정한다.
2. `torch.utils.data.random_split()` 함수로 훈련 집합(120000개),  검증 집합(10000개),  실험 집합(나머지)을 분할한다.
3. 각 집합을 한 번에 배치 사이즈만큼 읽어 오는 `DataLoader()` 객체를 만든다.



In [16]:
train_dataset, valid_dataset, test_dataset = random_split(
    dataset=dataset,
    lengths=[120000, 10000, len(dataset)-130000],
    generator=torch.Generator().manual_seed(42)
    )
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(dataset=valid_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

# Training & evaluating the neural network

##  Loss function

신경망 훈련을 위해 교차 엔트로피 손실함수를 가져온다. 실제로는 먼저 소프트맥스 활성화함수를 적용한 후에 L_CE를 계산하는 구조로 되어 있다.

In [None]:
# criterion = nn.CrossEntropyLoss()

## Optimizer

최적화기를 Adam으로 사용하고 초기 학습률을 0.01로 설정한다. 최적화기와 학습률은 이후 검증 집합에서의 성능을 살펴보면서 변경할 수 있다.

In [17]:
optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5,
                                                 verbose=True)

## Training & Evaluating

Based on the analysis on training data done using the cross entropy loss function and an optimizer, we update the neural network's hyperparameters and check accuracy of the neural model based on its accuracy on validation set. The `epoch` hyperparameter may be altered.

---


손실함수와 최적화기를 사용하여 훈련 집합의 데이터로 신경망의 가중치 매개변수를 업데이트하고, 검증 집합에서의 정확도(accuracy)로 성능을 확인한다. epoch 횟수는 이후 검증 집합에서의 성능을 살펴보면서 변경할 수 있다.

###  Writing a function that calculates accuracy on validation set.

In [18]:
def train(net, optimizer, loader, epoch):
    net.train()
    correct = sum_loss = 0

    for inputs, labels in loader:
        optimizer.zero_grad()

        outputs = net(inputs)
        labels = labels.to(outputs.device)

        # Calculate Loss: softmax --> cross entropy loss
        loss = F.cross_entropy(outputs, labels, reduction="sum")
        sum_loss += loss

        # Clear gradients w.r.t. parameters
        loss.backward()
        optimizer.step()

        predict = outputs.argmax(-1) == labels
        correct += sum(predict)

    avg_loss = sum_loss / len(loader.dataset)
    accuracy = correct / len(loader.dataset)

    if epoch == 1 or not epoch % print_every:
        print(
            f"Epoch: {epoch}. Train Loss: {avg_loss.item()}. "
            f"Train Accuracy: {accuracy:.2%}")


def validate(net, loader, scheduler, eval_type, epoch):
    correct = loss = 0
    with torch.no_grad():
        net.eval()

        for (inputs, labels) in loader:
            outputs = net(inputs)
            labels = labels.to(outputs.device)

            loss += F.cross_entropy(outputs, labels, reduction="sum")
            predict = outputs.argmax(-1) == labels
            correct += sum(predict)

        loss = loss / len(loader.dataset)
        accuracy = correct / len(loader.dataset)
        if scheduler:
            scheduler.step(loss)

    if epoch == 1 or not epoch % print_every:
        print(
            f"Epoch: {epoch}. {eval_type} Loss: {loss.item()}. "
            f"{eval_type} Accuracy: {accuracy:.2%}")

In [19]:
for epoch in range(1, num_epochs + 1):
    train(net, optimizer, train_loader, epoch)
    validate(net, valid_loader, scheduler, "Validation", epoch)

Epoch: 1. Train Loss: 0.7710070013999939. Train Accuracy: 58.23%
Epoch: 1. Validation Loss: 0.6138774752616882. Validation Accuracy: 65.23%
Epoch: 5. Train Loss: 0.5158765912055969. Train Accuracy: 75.33%
Epoch: 5. Validation Loss: 0.4951203465461731. Validation Accuracy: 76.07%
Epoch: 10. Train Loss: 0.49089348316192627. Train Accuracy: 76.98%
Epoch: 10. Validation Loss: 0.48478788137435913. Validation Accuracy: 76.76%
Epoch: 15. Train Loss: 0.47643008828163147. Train Accuracy: 77.98%
Epoch: 15. Validation Loss: 0.4685654044151306. Validation Accuracy: 78.19%
Epoch: 20. Train Loss: 0.4668131172657013. Train Accuracy: 78.66%
Epoch: 20. Validation Loss: 0.4639536440372467. Validation Accuracy: 78.05%
Epoch: 25. Train Loss: 0.46067723631858826. Train Accuracy: 78.89%
Epoch: 25. Validation Loss: 0.46031635999679565. Validation Accuracy: 78.47%
Epoch: 30. Train Loss: 0.4529385566711426. Train Accuracy: 79.34%
Epoch: 30. Validation Loss: 0.4830315411090851. Validation Accuracy: 76.53%
Epoch

Accuracy calculation on experiment set.

In [20]:
validate(net, test_loader, None, "Test", epoch)

Epoch: 55. Test Loss: 0.44618508219718933. Test Accuracy: 79.28%


### Fine-tuning and optimizing the model for maximum accuracy by adjusting hyperparameters

# Predicting the gender

Let us check whether the neural network works for name inputs that are not in the experiment set. The name "Shrek" is one of such name inputs.

---


실험 집합에도 없는 새로운 이름에 대해서도 신경망이 잘 작동하는지 확인해 보자.

예를 들어 "Shrek"이라는 이름은 데이터에 없다.

In [21]:
print(['Shrek'] in names)

False


In [22]:
net.eval()
input = torch.tensor(embeddings.wv['Shrek'], device=device)
output = net(input.unsqueeze(0))
"M" if output.squeeze().argmax().item() else "F"

'M'

### Let us see as what gender the model predicts the name "Shica".

In [23]:
input = torch.tensor(embeddings.wv['Shica'], device=device)
output = net(input.unsqueeze(0))
"M" if output.squeeze().argmax().item() else "F"

'F'