In this naive test approach we are:
- Representing all words as vectors for keywords and categories. 
- Summarizing words vectors for each category
- Summarizing words vectors for all keywords over the channel(I should note that simply summarizing looks like not the best choice, I will investigate and tests this part later)
- Searching for k-nearest neighbors using the cosine distance. (For tests we are using NearestNeighbors brute force algorithm, it should be changed to Local Sensitivity Hashing NearestNeighbors clustering that is the best choice for this task in the terms of performance).

Let's define the simple class that will do such classification:

In [41]:
import numpy
from sklearn.neighbors import NearestNeighbors
import spacy


class KeywordsClassifier(object):

    def __init__(self, nearestNeighborsLearner, categories, topN):
        self.nearestNeighborsLearner = nearestNeighborsLearner 
        self.categories = categories 
        self.topN = topN 

    @classmethod
    def CreateKeywordsClassifier(cls, nlp, categories, topNCount):
        '''
        @param nlp: spacy dictinary 
        @param categories: list of categories
        @param topNCount: number of categories to chose
        @return: instance of KeywordsClassifier class
        '''
        catDic = KeywordsClassifier.GetVerticalsVectorsDict(nlp, categories)
        catArray = numpy.array(list(catDic.values()))
        nn = NearestNeighbors(topNCount, algorithm='brute', metric='cosine') #for production we should use LSHForest instead
        nn.fit(catArray)
        return KeywordsClassifier(nn, list(catDic.keys()), topNCount)

    @staticmethod
    def GetVerticalsVectorsDict(nlp, verticals):
        verticalsDict = {}

        for category in verticals:
            text = nlp(category.replace("&", ""))
            verticalsDict[category] = text.vector

        return verticalsDict


    def ClassifyKeywords(self, keywordsVector):
        '''
        @param keywordsVector: resulting keywords vector
        @return: ordered list of tuples representing of top N categories names with cosine similarities
        '''
        result = []
        distances, indices = self.nearestNeighborsLearner.kneighbors(keywordsVector.reshape(1, -1))
        curIter = 0

        for curIndex in numpy.nditer(indices):
            category = self.categories[curIndex]
            cosDistance = distances[0][curIter]
            cosineSim = 1 - cosDistance
            tuple = (category, cosineSim)
            result.append(tuple)
            curIter += 1

        return result

Before we can start we need to load the spacy dictionary:

In [42]:
if 'nlp' not in globals(): # to avoid loading huge spacy dictionary several times
    print("Loading dictinary...")
    nlp = spacy.load('en')
    print("Dictinary loaded.")

Simple help function: 

In [43]:
def ClassifyIt(categories, keywords, topN):
    '''
        @param categories: list of categories
        @param keywords: a single string that contains keywords
        @param topNCount: number of categories to choose
    '''
    keywordsVect = nlp(keywords).vector
    classifier = KeywordsClassifier.CreateKeywordsClassifier(nlp, categories, topN)
    results = classifier.ClassifyKeywords(keywordsVect)
    print(results)

Now we can play with some toy-examples and see the results:

In [44]:
categories = { 'fruits', 'cars', 'motorcycles', 'animals', 'people', 'politics', 'design'}
keywords = 'BMW, repair, NISSAN, wheel, TOYOTA, road signs'
topN = 1 #count of categories to chose

ClassifyIt(categories, keywords, topN)

[('cars', 0.62240266799926758)]


The answer is correct - cars. Note that we have two similar categories - motorcycles and cars(some of the brands from categories above are producing both cars and motorcycles), but that was not a problem. Let change the keywords a little if what were keywords from motorcycles group to answer the question: will we able to classify it still correctly?

In [45]:
keywords = 'BMW, SUZUKI, Gixer, Harley-Davidson, road signs'
ClassifyIt(categories, keywords, 1)


[('motorcycles', 0.57702171802520752)]


Correct. 
Lets try mixed classification:


In [49]:
keywords = 'BMW, cherry, NISSAN, banana, TOYOTA, apple, road signs'
topN = 2 #count of categories to chose
ClassifyIt(categories, keywords, topN)

[('cars', 0.53120368719100952), ('fruits', 0.41091710329055786)]


Now some real, problematic example - MLB channel https://www.youtube.com/user/MLB/ the largest youtube channel in our database: more than 3M of keywords. As categories, we will use verticals from Google AdWords and as keywords - tags of all videos.

In [63]:
import csv
from collections import Counter

def GetLargestKeywordsDb():
    sumVector = numpy.zeros(nlp.vocab.vectors_length)

    keywords = []
    print("Loading keywords...")
    with open('235227_keywords.csv', 'rt') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        for row in reader:
            keywords.append(row[0])

    print("Calculating vectors sum...")
    counterDict = Counter(keywords) #optimization for keywords duplicates
    for word, repCount in counterDict.items():  # summarizing words vectors
        curVect = nlp(word).vector
        sumVector += (curVect * repCount)

    return sumVector

def GetVerticalsListFromFile():
    verticals = []
    with open('verticals.csv', 'rt') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader, None)  # skip the headers
        for row in reader:
            verticals.append(row[0].replace("&", ""))
    return verticals

if 'keywordsVectL' not in globals(): 
    keywordsVectL = GetLargestKeywordsDb()

categoriesL = GetVerticalsListFromFile()
    
classifier = KeywordsClassifier.CreateKeywordsClassifier(nlp, categoriesL, 4)
results = classifier.ClassifyKeywords(keywordsVectL)
print(results)

[('American Football', 0.77245274247048656), ('American Football Equipment', 0.74987778186729015), ('Chicago', 0.72816600489182681), ('Baseball', 0.7276862459056388)]


We see that the correct answer is in the 4th place. There can be several reason for that and this should be investigated additionally.

For some channels that I have tested, this classification approach works, generally, correctly, for some - not. 