<a href="https://colab.research.google.com/github/dvircohen0/Facial-recognition-system/blob/main/single_label_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Download train/test datasets for single-label text categorization

In [None]:
!wget https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-train-all-terms.txt
!wget https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-test-all-terms.txt

download Glove model from kaggle

In [8]:
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d watts2/glove6b50dtxt
!unzip /content/glove6b50dtxt.zip

Saving kaggle.json to kaggle (1).json
glove6b50dtxt.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  /content/glove6b50dtxt.zip
replace glove.6B.50d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n


In [9]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

Read the data as pandas DataFeame and label the columns

In [14]:
train = pd.read_csv('r8-train-all-terms.txt', header=None, sep='\t')
test = pd.read_csv('r8-test-all-terms.txt', header=None, sep='\t')
train.columns = ['label', 'content']
test.columns = ['label', 'content']

Define the GloveVectorizer using the glove model
we define the init, fit and transforms functions so we could use
the model to any given data

In [18]:
class GloveVectorizer:
  def __init__(self):
    # load in pre-trained word vectors
    print('Loading word vectors...')
    word2vec = {}
    embedding = []
    idx2word = []
    with open('glove.6B.50d.txt',encoding="utf8") as f:
      # is just a space-separated text file in the format:
      # word vec[0] vec[1] vec[2] ...
      for line in f:
        values = line.split()
        word = values[0]
        vec = np.asarray(values[1:], dtype='float32')
        word2vec[word] = vec
        embedding.append(vec)
        idx2word.append(word)
    print('Found %s word vectors.' % len(word2vec))

    self.word2vec = word2vec
    self.embedding = np.array(embedding)
    self.word2idx = {v:k for k,v in enumerate(idx2word)}
    self.V, self.D = self.embedding.shape

  def fit(self, data):
    pass

  def transform(self, data):
    X = np.zeros((len(data), self.D))
    n = 0
    emptycount = 0
    for sentence in data:
      tokens = sentence.lower().split()
      vecs = []
      for word in tokens:
        if word in self.word2vec:
          vec = self.word2vec[word]
          vecs.append(vec)
      if len(vecs) > 0:
        vecs = np.array(vecs)
        X[n] = vecs.mean(axis=0)
      else:
        emptycount += 1
      n += 1
    print("Numer of samples with no words found: %s / %s" % (emptycount, len(data)))
    return X

  def fit_transform(self, data):
    self.fit(data)
    return self.transform(data)

we creating GloveVectorizer object and fit-transforms the glove model on the input data

In [19]:
vectorizer = GloveVectorizer()
Xtrain = vectorizer.fit_transform(train.content)
Ytrain = train.label

Xtest = vectorizer.transform(test.content)
Ytest = test.label

Loading word vectors...
Found 400000 word vectors.
Numer of samples with no words found: 0 / 5485
Numer of samples with no words found: 0 / 2189


create the model, train it, print scores

In [20]:
model = RandomForestClassifier(n_estimators=200)
model.fit(Xtrain, Ytrain)
print("train score:", model.score(Xtrain, Ytrain))
print("test score:", model.score(Xtest, Ytest))

train score: 0.9992707383773929
test score: 0.9328460484239379
