# ACM categories

This tutorial shows how ACM categories can be studied with Gismo.

Imagine that you want to submit an article and are asked to provide an ACM category and some generic keywords. Let see how Gismo can help you.

Here, *documents* are ACM categories. The *features* of a category will be the words of its name along with the words of the name of its descendants.

## Initialisation

First, we load the required package.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from gismo.datasets.acm import get_acm, flatten_acm
from gismo.corpus import Corpus
from gismo.embedding import Embedding
from gismo.gismo import Gismo
from gismo.post_processing import print_feature_cluster

Then, we load the ACM source. Note that we flatten the source, i.e. the existing hierarchy is discarded, as Gismo will provide its own dynamic, query-based, structure.

In [2]:
acm = flatten_acm(get_acm())

Each category in the ``acm`` list is a dict with ``name`` and ``query``. We build a corpus that will tell Gismo that the content of a category is its ``query`` value.

In [3]:
corpus = Corpus(acm, to_text=lambda x: x['query'])

We build an embedding on top of that corpus.
- We set ``min_df=3`` to exclude rare features;
- We set ``ngram_range=[1, 3]`` to include bi-grams and tri-grams in the embedding.
- We manually pick a few common words to exclude from the emebedding.

In [4]:
vectorizer = CountVectorizer(min_df=3, ngram_range=[1, 3], dtype=float, stop_words=['to', 'and'])
embedding = Embedding(vectorizer=vectorizer)
embedding.fit_transform(corpus)

In [5]:
embedding.x

<234x6936 sparse matrix of type '<class 'numpy.float64'>'
	with 28041 stored elements in Compressed Sparse Row format>

We see from ``embedding.x`` that the embedding links 234 documents to 6,936 features. There are 28,041 weights: in average, each document is linked to more than 100 features, each feature is linked to 4 documents.

Now, we initiate the gismo object, and customize post_processers to ease the display.

In [6]:
gismo = Gismo(corpus, embedding)
gismo.post_document = lambda g, i: g.corpus[i]['name']
def print_document_cluster(gismo, cluster, depth=""):
    sim = cosine_similarity(cluster.vector, gismo.diteration.y_relevance.reshape(1, -1))[0][0]
    if len(cluster.children) == 0:
        txt = gismo.corpus[cluster.indice]['name']
        print(f"{depth} {txt} "
              f"(R: {gismo.diteration.x_relevance[cluster.indice]:.2f}; "
              f"S: {sim:.2f})")
    else:
        print(f"{depth} F: {cluster.focus:.2f}. "
              f"R: {sum(gismo.diteration.x_relevance[cluster.members]):.2f}. "
              f"S: {sim:.2f}.")
    for c in cluster.children:
        print_document_cluster(gismo, c, depth=depth + '-')
gismo.post_document_cluster = print_document_cluster
gismo.post_feature_cluster = print_feature_cluster

## Machine Learning query

We perform the query *Machine learning*. The returned ``True`` tells that some of the query features were found in the corpus' features.

**Remark:** For this tutorial, we just enter a few words, but at the start of this Notebook, we talked about submitting an article. As a query can be as long as you want, you can call the ``rank`` method with the full textual content of your article if you want to.

In [7]:
gismo.rank("Machine learning")

True

What are the best ACM categories for an article on *Machine Learning*?

In [8]:
gismo.get_ranked_documents()

['Machine learning',
 'Computing methodologies',
 'Machine learning algorithms',
 'Learning paradigms',
 'Machine learning theory',
 'Machine learning approaches',
 'Theory and algorithms for application domains',
 'Theory of computation',
 'Natural language processing',
 'Artificial intelligence']

Sounds nice. How are these domains related in the context of *Machine Learning*?

In [9]:
gismo.get_clustered_ranked_documents()

 F: 0.06. R: 0.52. S: 0.75.
- F: 0.63. R: 0.48. S: 0.73.
-- F: 0.78. R: 0.41. S: 0.70.
--- F: 0.98. R: 0.16. S: 0.85.
---- Machine learning (R: 0.09; S: 0.84)
---- Computing methodologies (R: 0.06; S: 0.87)
--- Learning paradigms (R: 0.06; S: 0.62)
--- F: 0.94. R: 0.14. S: 0.63.
---- Machine learning theory (R: 0.06; S: 0.61)
---- Theory and algorithms for application domains (R: 0.05; S: 0.63)
---- Theory of computation (R: 0.04; S: 0.66)
--- Machine learning approaches (R: 0.05; S: 0.54)
-- Machine learning algorithms (R: 0.06; S: 0.60)
- F: 0.66. R: 0.04. S: 0.23.
-- Natural language processing (R: 0.03; S: 0.21)
-- Artificial intelligence (R: 0.02; S: 0.30)


OK! Let's decode this:
- Mainstream we have two main groups
    - the practical fields (methodology, paradigms)
    - the theoretical fields
- If you don't want to decide, you can go with approaches/algorithms.
- But maybe your article uses machine learning to achieve NLP or AI?

Now, let's look at the main keywords.

In [10]:
gismo.get_ranked_features()

['learning',
 'reinforcement',
 'reinforcement learning',
 'decision',
 'machine',
 'supervised learning',
 'supervised',
 'iteration',
 'learning learning',
 'machine learning']

Let's organize them.

In [11]:
gismo.get_clustered_ranked_features()

 F: 0.62. R: 0.01. S: 0.93.
- F: 0.84. R: 0.01. S: 0.92.
-- F: 0.87. R: 0.01. S: 0.92.
--- learning (R: 0.00; S: 0.96)
--- reinforcement (R: 0.00; S: 0.83)
--- reinforcement learning (R: 0.00; S: 0.83)
--- decision (R: 0.00; S: 0.96)
--- machine (R: 0.00; S: 0.95)
--- supervised learning (R: 0.00; S: 0.81)
--- supervised (R: 0.00; S: 0.81)
--- machine learning (R: 0.00; S: 0.93)
-- learning learning (R: 0.00; S: 0.75)
- iteration (R: 0.00; S: 0.68)


Hum, not very informative. Let's increase the resolution to get more structure!

In [12]:
gismo.get_clustered_ranked_features(resolution=.97)

 F: 0.62. R: 0.01. S: 0.93.
- F: 0.84. R: 0.01. S: 0.92.
-- F: 0.87. R: 0.01. S: 0.92.
--- F: 0.96. R: 0.01. S: 0.96.
---- learning (R: 0.00; S: 0.96)
---- decision (R: 0.00; S: 0.96)
---- machine (R: 0.00; S: 0.95)
---- machine learning (R: 0.00; S: 0.93)
--- F: 0.92. R: 0.00. S: 0.84.
---- F: 1.00. R: 0.00. S: 0.83.
----- reinforcement (R: 0.00; S: 0.83)
----- reinforcement learning (R: 0.00; S: 0.83)
---- F: 1.00. R: 0.00. S: 0.81.
----- supervised learning (R: 0.00; S: 0.81)
----- supervised (R: 0.00; S: 0.81)
-- learning learning (R: 0.00; S: 0.75)
- iteration (R: 0.00; S: 0.68)


Rough analysis:
- Machine learning is about... Machine learning, which seems related to decision.
- Reinforcement learning and supervised learning seem to be categories of interest.
- Iteration is a lone wolf. Rather important in the context of ML, but not highly related to other keywords so set apart.

## P2P query

We perform the query *P2P*. The returned ``False`` tells that P2P is not a feature of the corpus (it's a small corpus after all, made only of catagory titles).

In [13]:
gismo.rank("P2P")

False

Let's try to avoid the acronym. Ok, now it works.

In [14]:
gismo.rank("Machine learning")

True

What are the best ACM categories for an article on *P2P*?

In [15]:
gismo.get_ranked_documents()

['Machine learning',
 'Computing methodologies',
 'Machine learning algorithms',
 'Learning paradigms',
 'Machine learning theory',
 'Machine learning approaches',
 'Theory and algorithms for application domains',
 'Theory of computation',
 'Natural language processing',
 'Artificial intelligence']

Sounds nice. How are these domains related in the context of *P2P*?

In [16]:
gismo.get_clustered_ranked_documents()

 F: 0.06. R: 0.52. S: 0.75.
- F: 0.63. R: 0.48. S: 0.73.
-- F: 0.78. R: 0.41. S: 0.70.
--- F: 0.98. R: 0.16. S: 0.85.
---- Machine learning (R: 0.09; S: 0.84)
---- Computing methodologies (R: 0.06; S: 0.87)
--- Learning paradigms (R: 0.06; S: 0.62)
--- F: 0.94. R: 0.14. S: 0.63.
---- Machine learning theory (R: 0.06; S: 0.61)
---- Theory and algorithms for application domains (R: 0.05; S: 0.63)
---- Theory of computation (R: 0.04; S: 0.66)
--- Machine learning approaches (R: 0.05; S: 0.54)
-- Machine learning algorithms (R: 0.06; S: 0.60)
- F: 0.66. R: 0.04. S: 0.23.
-- Natural language processing (R: 0.03; S: 0.21)
-- Artificial intelligence (R: 0.02; S: 0.30)


OK! Let's decode this:
- Mainstream is obviously *networks*, with two main groups
    - the design fields (*distributed architecture*, *organization*)
    - the implementation fields (*software*)
- Inside networks, but a little bit isolated, *search engine architectures and scalability* calls for the scalable property of P2P networks. The SE reference probably comes from Distributed Hash Tables, one of the main theoretical and practical success of P2P.

Now, let's look at the main keywords.

In [17]:
gismo.get_ranked_features()

['learning',
 'reinforcement',
 'reinforcement learning',
 'decision',
 'machine',
 'supervised learning',
 'supervised',
 'iteration',
 'learning learning',
 'machine learning']

Let's organize them.

In [18]:
gismo.get_clustered_ranked_features()

 F: 0.62. R: 0.01. S: 0.93.
- F: 0.84. R: 0.01. S: 0.92.
-- F: 0.87. R: 0.01. S: 0.92.
--- learning (R: 0.00; S: 0.96)
--- reinforcement (R: 0.00; S: 0.83)
--- reinforcement learning (R: 0.00; S: 0.83)
--- decision (R: 0.00; S: 0.96)
--- machine (R: 0.00; S: 0.95)
--- supervised learning (R: 0.00; S: 0.81)
--- supervised (R: 0.00; S: 0.81)
--- machine learning (R: 0.00; S: 0.93)
-- learning learning (R: 0.00; S: 0.75)
- iteration (R: 0.00; S: 0.68)


Rough analysis:
- One cluster about network protocols
- One cluster about architectures

## PageRank query

We perform the query *PageRank*. The returned ``False`` tells that *PageRank* is not a feature of the corpus (it's a small corpus after all, made only of catagory titles).

In [19]:
gismo.rank("Pagerank")

False

Let's try to avoid the copyright infrigment. Ok, now it works.

In [20]:
gismo.rank("ranking the web")

True

What are the best ACM categories for an article on *PageRank*?

In [21]:
gismo.get_ranked_documents()

['Web searching and information discovery',
 'World Wide Web',
 'Information systems',
 'Web applications',
 'Supervised learning',
 'Retrieval models and ranking',
 'Learning paradigms',
 'Information retrieval',
 'Machine learning',
 'Web mining']

Sounds nice. How are these domains related in the context of *PageRank*?

In [22]:
gismo.get_clustered_ranked_documents()

 F: 0.22. R: 0.43. S: 0.78.
- F: 0.22. R: 0.37. S: 0.75.
-- Web searching and information discovery (R: 0.08; S: 0.63)
-- F: 0.91. R: 0.13. S: 0.81.
--- World Wide Web (R: 0.08; S: 0.79)
--- Information systems (R: 0.05; S: 0.85)
-- F: 0.92. R: 0.09. S: 0.41.
--- Supervised learning (R: 0.04; S: 0.40)
--- Learning paradigms (R: 0.03; S: 0.43)
--- Machine learning (R: 0.02; S: 0.43)
-- F: 0.81. R: 0.07. S: 0.36.
--- Retrieval models and ranking (R: 0.04; S: 0.33)
--- Information retrieval (R: 0.03; S: 0.44)
- F: 0.38. R: 0.07. S: 0.46.
-- Web applications (R: 0.05; S: 0.49)
-- Web mining (R: 0.02; S: 0.34)


Hum, maybe somethin more compact. Let's lower the resolution (default resolution is 0.9).

In [23]:
gismo.get_clustered_ranked_documents(resolution=.8)

 F: 0.20. R: 0.43. S: 0.78.
- F: 0.75. R: 0.21. S: 0.70.
-- Web searching and information discovery (R: 0.08; S: 0.63)
-- World Wide Web (R: 0.08; S: 0.79)
-- Information systems (R: 0.05; S: 0.85)
- Web applications (R: 0.05; S: 0.49)
- F: 0.92. R: 0.09. S: 0.41.
-- Supervised learning (R: 0.04; S: 0.40)
-- Learning paradigms (R: 0.03; S: 0.43)
-- Machine learning (R: 0.02; S: 0.43)
- F: 0.81. R: 0.07. S: 0.36.
-- Retrieval models and ranking (R: 0.04; S: 0.33)
-- Information retrieval (R: 0.03; S: 0.44)
- Web mining (R: 0.02; S: 0.34)


Better! Let's broadly decode this:
- One cluster of categories is about the Web & Search
- One cluster is about learning techniques
- One cluster is about information retrieval.

Now, let's look at the main keywords.

In [24]:
gismo.get_ranked_features()

['web',
 'ranking',
 'social',
 'learning',
 'discovery',
 'supervised',
 'supervised learning',
 'security',
 'site',
 'learning by']

Let's organize them.

In [25]:
gismo.get_clustered_ranked_features()

 F: 0.02. R: 0.01. S: 0.89.
- F: 0.09. R: 0.01. S: 0.89.
-- F: 0.87. R: 0.01. S: 0.85.
--- web (R: 0.00; S: 0.87)
--- ranking (R: 0.00; S: 0.91)
--- social (R: 0.00; S: 0.84)
--- discovery (R: 0.00; S: 0.80)
--- site (R: 0.00; S: 0.77)
-- F: 0.94. R: 0.00. S: 0.36.
--- learning (R: 0.00; S: 0.44)
--- supervised (R: 0.00; S: 0.35)
--- supervised learning (R: 0.00; S: 0.35)
--- learning by (R: 0.00; S: 0.35)
- security (R: 0.00; S: 0.14)


Rough analysis:
- One cluster about the Web
- One cluster about learning
- One lone wolf: security