In [3]:
import os,re
import time,pickle
from tqdm import *
from os.path import expanduser
import numpy as np
import sklearn
import matplotlib.pylab as plt
%matplotlib inline
%load_ext autoreload
%autoreload 2
home = expanduser('~')
os.chdir(os.path.join(home,'Documents','project','agu_data','repo','agu_data'))
from Data_Utils import *

import plotly.plotly as py
from plotly.tools import FigureFactory as FF

import gensim
from gensim import corpora, models, similarities
from pprint import pprint
model_saved = os.path.join(home,'Documents','project','agu_data','repo','Notebook','Models')
abstractf = os.path.join(model_saved,'gensim','abstract','abstract')

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


American  Geophysical Union  (AGU) meeting  is a  geoscience conference
hold  each year  around Christmas  in San  Francisco. It  represents a
great opportunity for PhD students like  me to show off their work and
enjoy what the west coast has to offer. However, with  nearly 24 000 attendees,  AGU Fall Meeting is  also the
largest Earth  and space  science meeting  in the  world. As  such, it
represents an interesting data set to dive into the geoscience academic
world.  

In this post, I explore different information retrieval techniques taken from the field of natural language processing to explore the hidden patterns in the submitted abstract collection in 2015.

The objective is two fold:

- Identify semantic-based similarities between the contribution proposed at AGU to build a recommandation system based on the abstract content.
- Propose for each contributor a list of potential collaborators based on the authors of the papers proposed by our recommendantion system.

Different natural language processing tools are available in python to achieve this goal and after trying [sklearn](http://scikit-learn.org/stable/), I decided to settle on [gensim](https://radimrehurek.com/gensim/) which has partilarly fast and effective implementations to work with large dataset (~20000 abstracts here).

The basic stage, which I'll detail in the following are

- Cleaning the data.
- Construct a valid embedding for the corpus.
- Compute the similarities between the document within this embedding.

## Data cleaning

Data cleaning is an essential step for our recommendation system. Indeeed, our model is going to use the resulting corpus to build a consistent embedding of the abstracts and we don't want him to focus on unnecessary details. In particular, I used the module **unicodedata** to remove non-ascii caracters from the corpus.

In [4]:
data = get_all_data('agu2015')
sources = [df for df in data if (''.join(df.title) != "") and (df.abstract != '') and (len(df.abstract.split(' '))>100)]
sections = [df.section for df in sources]
abstracts = get_clean_abstracts(sources)
titles = get_clean_titles(sources)



In the following, I'll use on of my contributions to evaluate the consistency of our recommendation system. 

In [5]:
def name_to_idx(sources,name):
    ''' From an authors, return the list of contributions '''
    contrib = [f for f in sources if name in f.authors.keys()]
    return [sources.index(elt) for elt in contrib]
    
my_contrib = name_to_idx(sources,'Clement Thorey')
print 'Title : %s'%(titles[my_contrib[0]])
print 'Abstract : %s'%(abstracts[my_contrib[0]])+'\n\n'

Title :  Floor-Fractured Craters through Machine Learning Methods
Abstract : Floor-fractured craters are impact craters that have undergone post impact deformations. They are characterized by shallow floors with a plate-like or convex appearance, wide floor moats, and radial, concentric, and polygonal floor-fractures. While the origin of these deformations has long been debated, it is now generally accepted that they are the result of the emplacement of shallow magmatic intrusions below their floor. These craters thus constitute an efficient tool to probe the importance of intrusive magmatism from the lunar surface. The most recent catalog of lunar-floor fractured craters references about 200 of them, mainly located around the lunar maria Herein, we will discuss the possibility of using machine learning algorithms to try to detect new floor-fractured craters on the Moon among the 60000 craters referenced in the most recent catalogs. In particular, we will use the gravity field provided

May be a bit of context can be usefull here. My PhD was about the detection and the characterization of magmatic intrusions on terrestrial planets. For those who wonder, a magmatic intrusion is a large volume of magma which, instead of rising until the surface and form a volcano, emplace at depth beneath the surface (less than a few km) where it cools and solidifies. On Earth, erosion and weathering can sometimes expose these intrusions at the surface. This is the case for instance in the henry mountains

![Example of an exposed magmatic intrusion in the Henry Mountains](https://upload.wikimedia.org/wikipedia/commons/a/a6/Laccolith_Montana.jpg)

My contributions at AGU deals with the detection and the characterization of those intrusions.

The first one is about the detection of a specific family of magmatic intrusions on the Moon which we call crater-centered intrusions. Particularly, those are magmatic intrusions that have emplaced and solidify beneath large impact craters (>20km in diameter) at the surface of the Moon. Consequently, these crater are heavily deformed due to the magmatic intrusion with large network of fracture crossing their floor.  In this contribution, I use machine learning techniques to try to automatically detect potential floor-fractured craters among 60000 referenced lunar impact craters. 


## Bag of Words model

The basic representation for a corpus of text document is called a [Bag of Word (BoW) model](https://en.wikipedia.org/wiki/Bag-of-words_model). This model looks at all the words in the corpus and first build a dictionary referencing all the words it has seen. Then, for each document in the corpus, in simply count how many times each word of the dictionary appears in this particular document. The result is a large matrix, each row is a text document, each columns is a particular word of the dicitonarry, that is, as you can guess, mostly fill with zeros. 

### Tokenizer

Under the hood, the BoW model assume an efficient tokenizer function which is able to split each document it its own set of tokens. A vanilla tokenizer function looks like this

In [23]:
def tokenizer(text):
    return text.split(' ')

which simply look at each document and split it in a list of tokens according to the white spaces in the document. In the following, I'll use a slightly more evolve version of this tokenizer which I embedded in a Tokenizer class.

It use the [nltk](http://www.nltk.org/) library to first break each document (abstract) into sentences, then words.
Then using reg expression, it keeps only suitable tokens. In particular,

- `^[a-z]+$` keeps only words made of letters.
- `^[a-z][\d]$` selects tokens that have 2 characters, one letter, one number (molecule stuff).
- `^[a-z][\d][a-z]$` selects tokens that have 3 characters, one letter, one number, one letter (again molecule stuff).
- `^[a-z]{3}[a-z]*-[a-z]*$` includes some tokens that are composed of two words joined by -.

Next, I use a stopword list provide by **nltk** to filter out all the common word of the english language. Indeed, stopwords are words like 'the' or 'as' that are most likely present everywhere but do not carry meaningfull information in our purpose. This tokenizer also incorporates a last stage of stemming for each token. Stemming is the term used in information retrieval to describe the process for reducing words to their word stem, base or root form—generally a written word form. 

For instance, imagine this document

'Here we show that running is good for health. Indeed runner are quite healthy. Though they have runned a lot in their runly life, they are quite good at that.'

Clearly, this document is all about running! Nevertheless, without the stemming part in our tokenizer, 'runly' will have the same weight than 'good', equal to 1. In contrast, the stemming will reduce 'running', 'runned', 'runly' and 'runner' to their stem, namely 'run'. The word 'run' in the BoW will then have a weight of 4 for this document clearly underlying its importance ! I use the so-called SnowballStemmer included in the library **nltk** for stemming. 

Finally, in addition to these simple stem tokens, I also add the possibility to use bi-grams to the dictionary, i.e. all the combinations of two consecutive stem-words in each abstracts which is a common practise when using BoW model. We will see why later.


In [238]:
class Tokenizer(object):

    def __init__(self, add_bigram):
        self.add_bigram = add_bigram
        self.stopwords = nltk.corpus.stopwords.words('english')
        self.stemmer = nltk.stem.snowball.SnowballStemmer("english")

    def bigram(self, tokens):
        if len(tokens) > 1:
            for i in range(0, len(tokens) - 1):
                yield tokens[i] + '_' + tokens[i + 1]

    def tokenize_and_stem(self, text):
        tokens = [word.lower() for sent in nltk.sent_tokenize(text)
                  for word in nltk.word_tokenize(sent)]
        filtered_tokens = []
        bad_tokens = []
        # filter out any tokens not containing letters (e.g., numeric tokens, raw
        # punctuation)
        for token in tokens:
            if re.search('(^[a-z]+$|^[a-z][\d]$|^[a-z]\d[a-z]$|^[a-z]{3}[a-z]*-[a-z]*$)', token):
                filtered_tokens.append(token)
            else:
                bad_tokens.append(token)
        filtered_tokens = [
            token for token in filtered_tokens if token not in self.stopwords]
        stems = map(self.stemmer.stem, filtered_tokens)
        if self.add_bigram:
            stems += [f for f in self.bigram(stems)]
        return map(str, stems)


### Dictionary

Then, the next step is to build the dictionary. **Gensim** is built in a memory-friendly fashion. Therefore, instead of loading the whole corpus into memory, tokenizing and stemming everything and see what remains, it allows us to build the dictionary document by document, with one document in memory at a time. 

In [7]:
abstractf = os.path.join(model_saved,'gensim','abstract','abstract')
build = False
if build:
    # First, write the document corpus on a txt file, one document perline.
    write_clean_corpus(abstracts,abstractf+'data.txt')
    tokeniser = Tokenizer(False)
    # Next create the dictionary by iterating of the abstract, one per line in the txt file
    dictionary = corpora.Dictionary(tokenizer.tokenize_and_stem(line) for line in open(abstractf+'.txt')) 
    dictionary.save(abstractf+'_raw.dict')
else:
    tokeniser = Tokenizer(False)
    dictionary = corpora.Dictionary.load(abstractf+'_raw.dict')

The resulting dictionary contains 70150 tokens. While we could work out a BoW model from there, it is often a good idea to remove extreme tokens. For instance, a token appearing in only 1 abstract is not going to help us build a recommandation system. Similarly, a token that appears in all the documents is not likely to carry meaningfull information neither for our purpose. I thereferore decided to remove all tokens that appear in less than 5 abstracts and in more than 80% of them. Note that creating the dictionarry can take up to 1 minute on my laptop which make serialization a good idea.

In [8]:
build = False
if not os.path.isfile(abstractf+'_raw.dict') or build:
    dictionary =  corpora.Dictionary.load(abstractf+'_raw.dict')
    dictionary.filter_extremes(no_below=5,no_above=0.80,keep_n=200000)
    dictionary.id2token = {k:v for v,k in dictionary.token2id.iteritems()}
    dictionary.save(abstractf+'.dict')
else:
    dictionary = corpora.Dictionary.load(abstractf+'.dict')
    dictionary.id2token = {k:v for v,k in dictionary.token2id.iteritems()}

IOError: [Errno 2] No such file or directory: '/Users/cthorey/Documents/project/agu_data/repo/Notebook/Models/gensim/abstract/abstract.dict'

### BoW representation

Now we have the dictionary, it is actually easy to obtain the BoW representation of any document. We just have to tokenize the document using the same function used to build the dictionary and count the occurence of each word. Each dictionary in **gensim** possess a method **doc2bow** which does exactly that and return the representation as a sparse vector, i.e. a vector where only words that have a count different from zero are returned.

For instance, the BoW representation of my first abstract is 

In [311]:
my_contrib_bow = dictionary.doc2bow(tokenizer.tokenize_and_stem(abstracts[my_contrib[0]]))
df = [f+(dictionary.id2token[f[0]],) for f in my_contrib_bow]
df = pd.DataFrame(df,columns = ['id','Count','Token']).sort_values(by='Count',ascending = False)
df.index= range(len(df))
table = FF.create_table(df.head(5))
py.iplot(table, filename='Bow_0')

where the result are presented as a pandas dataframe for clarity and each id has been identified to its proper token. Indeed, each dictionary assign a unique integer id to all tokens appearing in the dictionary. Note that the BoW representation of my first abstract on Floor-Fractured craters, which underlies the importance of the stem token crater, lunar, intrusion, floor and classifi, is farely accurate.

By converting each abstract of the corpus in this doc2bow method, we can obtain the BoW representation of our full corpus. A careless memory way to do that is to just iterate the doc2bow method of our dictionary over the abstract list we have defined at the beginning. Nevertheless, this would end up storing the whole doc2bow representation into memory as a huge matrice. Instead, **gensim** has been designed such that it only requires that a corpus must be able to return one document vector (for instance, the doc2bow representation of the document here) at a time. We then define the BoW corpus as a sprecific object `MyCorpus` where the method `__iter__` is consistently defined to iter and transform each line of the txt file where the abstracts content is stored.

In [9]:
class MyCorpus(Tokenizer):

    def __init__(self, name, add_bigram):
        super(MyCorpus, self).__init__(add_bigram)
        self.name = name

    def load_dict(self):
        if not os.path.isfile(self.name + '.dict'):
            print 'You should build the dictionary first !'
        else:
            setattr(self, 'dictionary',
                    corpora.Dictionary.load(self.name + '.dict'))

    def __iter__(self):
        for line in open(self.name + '_data.txt'):
            # assume there's one document per line, tokens separated by
            # whitespace
            yield self.dictionary.doc2bow(self.tokenize_and_stem(line))

    def __str__(self, n):
        for i, line in enumerate(open(self.name + '_data.txt')):
            print line
            if i > n:
                break

In [10]:
build = False
if build:
    bow_corpus = MyCorpus(abstractf,False)
    corpora.MmCorpus.serialize(abstractf+'_bow.mm',bow_corpus)
else:
    bow_corpus = corpora.MmCorpus(abstractf+'_bow.mm')

`MyCorpus` also posseses a print method which return the BoW representation of the first n document. Again, the return representation is parsed, i.e. it contains only the counts for non-zero element

In [11]:
bow_corpus.__str__()

'MmCorpus(21935 documents, 14669 features, 2191053 non-zero entries)'

### Recommendation 

In the BoW representation of our corpus, each abstract is a point in a high-dimensional embedding (a 14669 dimensions embedding exactly). The *distance* or the *similarity* between one abstract and the rest of the corpus, according to some metrics, can then be used to compare different contributions together and then, to provide a recommendation list for a specific query. 

The euclidean distance is the more natural choice for the similarity measure. Given two vectors  $\vec{a}$ and $\vec{b}$, it is equal to 
$$d(\vec{a},\vec{b}) = \sqrt{(\vec{b}- \vec{a})\cdot(\vec{b}- \vec{a}) }$$


However, we'd like our distance to be independant of the magnitude of the difference between two vectors. For instance, we'd like to identify as similar two abstracts which contain exactly the same tokens even if their occurence differs significantly. The euclidean distance clearly does not have this property.

Accordingly, a more reliable measure for our purpose is called "the cosine similarity". For two vectors, $\vec{a}$ and $\vec{b}$, the cosine similarity $d$ is defined as :

$$ d(\vec{a},\vec{b})= \frac{\vec{a} \cdot \vec{b}}{|\vec{a}||\vec{b}|} = \cos(\vec{a},\vec{b})$$

In particular, this similarity measure is the dot product of the two normalized vector and hence, depends only on the angle between the two vectors (which is were its name comes from ;). It ranges from -1 when two vectors point in the opposite direction to 1 when they point in the same direction.

To compute the similarity of one query against our BoW representation, the natural procedure is to first transform our sparse representation into its dense equivalent, i.e. a matrice where the number of lines correspond to the number of tokens in the dictionary and the number of columns to the number of abstracts in the corpus. Then, we column normalize the matrice such that each document correspond to a unit vector in the representation space. Finaly, we take the dot product of the transposed matrice with the desired normalized query to get the cosine similarity agaist all documents in the corpus.

**Gensim** contains efficient utility functions to help converting from/to numpy matrice and therefore, this translates to

In [12]:
def recom(abstractf,name):
    dictionary = corpora.Dictionary.load(abstractf+'.dict') 
    corpus = corpora.MmCorpus(abstractf + '_'+str(name)+'.mm')
    index = similarities.MatrixSimilarity.load(abstractf+'_'+str(name)+'.index')
    score = index[corpus[my_contrib[0]]]
    results = pd.DataFrame(np.stack((np.sort(score)[::-1],np.array(titles)[np.argsort(score)[::-1]])).T,
                       columns = ['CosineSimilarity','Title'])
    return results

df = recom(abstractf,'bow')
for i,row in df.iterrows():
    print 'Recom %d - Cosine: %1.3f - Title: %s'%(i+1,float(row.CosineSimilarity),row.Title)
    if i>8:
        break

Recom 1 - Cosine: 1.000 - Title:  Floor-Fractured Craters through Machine Learning Methods
Recom 2 - Cosine: 0.465 - Title:  Preliminary Geological Map of the Ac-H-2 Coniraya Quadrangle of Ceres  An Integrated Mapping Study Using Dawn Spacecraft Data
Recom 3 - Cosine: 0.459 - Title:  The collisional history of dwarf planet Ceres revealed by Dawn
Recom 4 - Cosine: 0.453 - Title:  Structural and Geological Interpretation of Posidonius Crater on the Moon
Recom 5 - Cosine: 0.433 - Title:  Initial Results from a Global Database of Mercurian Craters
Recom 6 - Cosine: 0.431 - Title:  Lunar Crater Interiors with High Circular Polarization Signatures
Recom 7 - Cosine: 0.427 - Title:  Morphologic Analysis of Lunar Craters in the Simple-to-Complex Transition
Recom 8 - Cosine: 0.419 - Title:  Katabatically Driven Downslope Windstorm-Type Flows on the Inner Sidewall of Arizona's Barringer Meteorite Crater
Recom 9 - Cosine: 0.393 - Title:  Origin of the rock abundance anomaly at Tsiolkovskiy crater


## TF-IDF representation

One of the problem with the BoW representation is that it often puts too much weights on common words of the corpus. Indeed, while we remove most common words in english, i.e. the stopwords, word like 'present', 'show' of whatever is commonly use in the writing-abstract vocabulary can add some noise in regards to our purpose. In particular here, we would like to put more weights on tokens that make each abstract specific.

A common way to do this is to use a **Tf-Idf** normalization to re-weiht each count in the BoW representation by the frequency of the token in the whole corpus. **Tf** means term-frequency while **Tf–Idf** means term-frequency times inverse document-frequency. This way, the weight of common tokens in the corpus will be significantly lowered.

This implentation is available is **gensim** and can be easily combined with the BoW representation to get the representation of the corpus in the tf-idf space.


In [13]:
build = False
if not os.path.isfile(abstractf+'_tfidf.mm') or build:
    # First load the corpus and the dicitonary
    bow_corpus = corpora.MmCorpus(abstractf+'.mm')
    dictionary = corpora.Dictionary.load(abstractf+'.dict')
    # Initialize the tf-idf model
    tfidf = models.TfidfModel(bow_corpus)
    # Compute the tfidf of the corpus itself
    tfidf_corpus = tfidf[bow_corpus]
    # Serialize both for reuse
    tfidf.save(abstractf+'_tfidf.model')
    corpora.MmCorpus.serialize(abstractf+'_tfidf.mm',tfidf_corpus)
else:
    tfidf = models.TfidfModel.load(abstractf+'_tfidf.model')
    tfidf_corpus = corpora.MmCorpus(abstractf+'_tfidf.mm')

In [14]:
df = recom(abstractf,'tfidf')
for i,row in df.iterrows():
    print 'Recom %d - Cosine: %1.3f - Title: %s'%(i+1,float(row.CosineSimilarity),row.Title)
    if i>8:
        break

Recom 1 - Cosine: 1.000 - Title:  Floor-Fractured Craters through Machine Learning Methods
Recom 2 - Cosine: 0.488 - Title:  Structural and Geological Interpretation of Posidonius Crater on the Moon
Recom 3 - Cosine: 0.475 - Title:  The collisional history of dwarf planet Ceres revealed by Dawn
Recom 4 - Cosine: 0.462 - Title:  Katabatically Driven Downslope Windstorm-Type Flows on the Inner Sidewall of Arizona's Barringer Meteorite Crater
Recom 5 - Cosine: 0.450 - Title:  Initial Results from a Global Database of Mercurian Craters
Recom 6 - Cosine: 0.447 - Title:  Hydrological Evolution and Chemical Structure of the Hyper-acidic Spring-lake System on White Island, New Zealand
Recom 7 - Cosine: 0.444 - Title:  Lunar Crater Interiors with High Circular Polarization Signatures
Recom 8 - Cosine: 0.437 - Title:  Preliminary Geological Map of the Ac-H-2 Coniraya Quadrangle of Ceres  An Integrated Mapping Study Using Dawn Spacecraft Data
Recom 9 - Cosine: 0.424 - Title:  Morphologic Analysis

This indeed produces a slight improve of the score

## Latent Semantic Analysis (LSA) or (LSI)

Latent Semantic Analysis (LSA) or Indexing (LSI) is a common method in information retrieval to reduce the dimension of the representation space. The idea behind it is that a lot of the dimensions in the BoW model or equivalently, the Tf-idf model are redundant. For instance, the words machine and learning are more likely two occur together. Therefore, shrinking these two dimensions to only one which is form by a linear combination of the token machine and learning would reduce the dimension without any loss of information. More generally, the Latent Semantic Analysis aims to reduce the dimensions while keeping as much information possible present in the higher dimensal space by identifying deep semantic pattern in the corpus.

To identify this semantic structure, Latent Semantic Analysis used a linear algebra method called [Singular Value Decomposition (SVD)](https://en.wikipedia.org/wiki/Latent_semantic_analysis). Formally, the SVD is able to identify a consistent lower-dimensional approximation of the higher-dimensional tfidf space. **Gensim** implements the Latent Semantic Analysis under a model called `LsiModel` which can be used on top of our previous representation easily.

In [15]:
build = False
if not os.path.isfile(abstractf+'_lsi.mm') or build:
    # First load the corpus and the dicitonary
    tfidf_corpus = corpora.MmCorpus(abstractf+'_tfidf.mm')
    dictionary = corpora.Dictionary.load(abstractf+'.dict')
    # Initialize the LSI model
    lsi = models.LsiModel(tfidf_corpus,id2word=dictionary, num_topics=500)
    # Compute the tfidf of the corpus itself
    lsi_corpus = lsi[tfidf_corpus]
    # Serialize both for reuse
    lsi.save(abstractf+'_lsi.model')
    corpora.MmCorpus.serialize(abstractf+'_lsi.mm',lsi_corpus)
else:
    lsi = models.LsiModel.load(abstractf+'_lsi.model')
    lsi_corpus = corpora.MmCorpus(abstractf+'_lsi.mm')

In [16]:
df = recom(abstractf,'lsi')
for i,row in df.iterrows():
    print 'Recom %d - Cosine: %1.3f - Title: %s'%(i+1,float(row.CosineSimilarity),row.Title)
    if i>8:
        break

Recom 1 - Cosine: 1.000 - Title:  Floor-Fractured Craters through Machine Learning Methods
Recom 2 - Cosine: 0.827 - Title:  Lunar Crater Interiors with High Circular Polarization Signatures
Recom 3 - Cosine: 0.822 - Title:  Morphologic Analysis of Lunar Craters in the Simple-to-Complex Transition
Recom 4 - Cosine: 0.815 - Title:  Continuous Bombardment  Effect of Small Primary and Secondary Impacts on the Lunar Regolith
Recom 5 - Cosine: 0.813 - Title:  Structural and Geological Interpretation of Posidonius Crater on the Moon
Recom 6 - Cosine: 0.780 - Title:  Comparing Radar and Optical Data Sets of Lunar Impact Crater Ejecta
Recom 7 - Cosine: 0.776 - Title:  Katabatically Driven Downslope Windstorm-Type Flows on the Inner Sidewall of Arizona's Barringer Meteorite Crater
Recom 8 - Cosine: 0.763 - Title:  The collisional history of dwarf planet Ceres revealed by Dawn
Recom 9 - Cosine: 0.759 - Title:  Preliminary Geological Map of the Ac-H-2 Coniraya Quadrangle of Ceres  An Integrated M

In [141]:
lsi.show_topics(2)

[u'0.119*"model" + 0.116*"water" + 0.108*"climat" + 0.103*"soil" + 0.097*"chang" + 0.093*"data" + 0.086*"ice" + 0.083*"surfac" + 0.079*"temperatur" + 0.077*"region"',
 u'-0.333*"fault" + -0.260*"earthquak" + -0.213*"seismic" + 0.196*"soil" + -0.150*"slip" + 0.150*"climat" + -0.129*"mantl" + 0.125*"water" + -0.116*"veloc" + -0.109*"plate"']

## t-sne 

In [20]:
from sklearn.manifold import TSNE

In [156]:
lsi_corpus.length

In [None]:
tsne = TSNE(n_components=2, random_state=0,metric='cosine')
X = gensim.matutils.corpus2dense(lsi_corpus, num_terms=lsi.num_terms)
X_tsne = tsne.fit_transform(X)

In [155]:
tfidf_corpus

<gensim.corpora.mmcorpus.MmCorpus at 0x123a4ea90>

In [21]:
from sklearn import manifold
manifold.TSNE(n_components=2, init='pca', random_state=0)
lsi_corpus = corpora.MmCorpus(abstractf + '_lsi.mm')
X = gensim.matutils.corpus2dense(lsi_corpus, num_terms=lsi_corpus.num_terms)

ImportError: No module named tsne

In [327]:
x_data = np.asarray(X).astype('float64')

In [337]:
x_data.shape

(500, 21935)

In [338]:
vis_data = bh_sne(x_data.T)

In [355]:
import seaborn as sns


axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.



In [367]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF 
plotly.offline.init_notebook_mode()  

In [None]:
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.tools import FigureFactory as FF 
plotly.offline.init_notebook_mode()  

In [369]:
#data
x_data = vis_data[:,0]
y_data = vis_data[:,1]
lab = sections_id

In [392]:
color = sns.color_palette('Set2',27)

In [400]:
[f.upper() for f in map(str,color.as_hex())]

['#66C2A5',
 '#FA8E63',
 '#8DA0CB',
 '#E68AC3',
 '#A7D854',
 '#FFD92F',
 '#E4C494',
 '#B3B3B3',
 '#66C2A5',
 '#FA8E63',
 '#8DA0CB',
 '#E68AC3',
 '#A7D854',
 '#FFD92F',
 '#E4C494',
 '#B3B3B3',
 '#66C2A5',
 '#FA8E63',
 '#8DA0CB',
 '#E68AC3',
 '#A7D854',
 '#FFD92F',
 '#E4C494',
 '#B3B3B3',
 '#66C2A5',
 '#FA8E63',
 '#8DA0CB']

In [402]:
colorscale = zip(sections_id,sns.color_palette('Set2',27).as_hex())

trace = go.Scattergl(
    x = x_data,
    y = y_data,
    text = sections,
    hoverinfo = 'text',
    mode = 'markers',
    marker = dict(
        opacity = 0.5,
        color = [f.upper() for f in map(str,color.as_hex())],
        showscale = True
    )
)

layout = go.Layout(
    height = 700,
    width = 900,
    margin = {'b':30,'r':30,'l':30,'t':30},
    title='Nb of sessions by section in 2015',
    legend = {'yanchor':'auto',
              'x':.85,
             'font':{'size':15}})

data = [trace]
fig = go.Figure(data=data,layout=layout)
plotly.offline.iplot(fig, show_link=False)

In [391]:
set(y_data)

{1.6018223793520077,
 -16.072555459021807,
 -17.418182239221224,
 18.206085260889481,
 24.125068994391508,
 -2.437359747462446,
 -6.8620089195656773,
 25.193344701797976,
 1.848047298025878,
 -17.786757742078798,
 -2.6664408723447375,
 -5.6501290980471106,
 11.573914030857082,
 3.0388222938387894,
 -36.713361354835577,
 -24.38288753649632,
 -24.02652973947761,
 0.91573910244581302,
 -9.7479240628527464,
 11.224396533411081,
 -37.227429083827296,
 1.733207806760408,
 -25.210676037283868,
 19.184748134061156,
 5.2287927869130462,
 32.265630037286428,
 -7.9482682507124753,
 -10.061808214891565,
 9.7312140147492929,
 -31.129406782166061,
 -9.3408798859568094,
 19.007487393509034,
 18.120982317303795,
 -0.79802097338242739,
 30.485929417733697,
 16.596225630754347,
 -1.248742343767421,
 23.814312044232633,
 -28.533067701760185,
 3.5695649044752216,
 33.547035711869214,
 -9.3985651425340926,
 11.544714171341282,
 -8.7896974128035925,
 -0.17766376944055029,
 -15.94270853001338,
 12.0666069939

In [334]:
dict_section = dict(zip(set(sections),range(len(set(sections)))))
y_data = map(lambda x:dict_section[x],sections)

In [336]:
len(sections_id),len(vis_x)

(21935, 500)