In [1]:
import os,re
import time,pickle
from tqdm import *
from os.path import expanduser

# Introduction

The AGU conference is hold each year in San Francisco a few weeks before christmas.

With nearly 24,000 attendees, AGU Fall Meeting is the largest Earth and space science meeting in the world. As such, it is also a great dataset to study trends in the geoscience community. In the following, I use the papers contribution in the year 2015 and 2014 to find out some hidden structure in the abstracts.
The aim is two fold:

- For a given contribution, identify a list of similar contribution in the database
- For the contributions of a particular authors, identify a list of potential collaborators with names, adress and institute displayed on a map.

The method is greatly inspired from [Amir Amini](https://www.kaggle.com/amirhamini/d/benhamner/nips-2015-papers/find-similar-papers-knn/notebook) and [brandonmrose](http://brandonrose.org/clustering).

# Data preprocessing 

## Scrapping the AGU website 

The scripts used to scrap the [AGU wesbsite](https://fallmeeting.agu.org/2015/) as well as the resulting data are stored on this [repo](https://github.com/cthorey/agu_data) if you want to reproduce the following by yourself.

In [2]:
home = expanduser('~')
os.chdir(os.path.join(home,'Documents','repos','agu_data','agu_data'))
from Data_Utils import *

data = get_all_data('agu2015')
abstracts = [df.abstract for df in data if (df.title != [""]) and (df.abstract != '')]
titles = [' '.join(df.title) for df in data if (''.join(df.title) != "") and (df.abstract != '')]
sources = [df for df in data if (df.title != [""]) and (df.abstract != '')]



AGU abstract are short, $\sim 300$ words and looks like that

In [3]:
idx = 300
print '\n The title of this paper is : \n %s'%(titles[idx])
print '\n The corresponding abstract is :\n %s'%(abstracts[idx])


 The title of this paper is : 
  Slow Climate Velocities in Mountain Streams Impart Thermal Resistance to Cold-Water Refugia Across the West

 The corresponding abstract is :
 ePoster - PA31C-2165_JHPark_AGU.pdf
The purpose of the study is to find small greens’ disposition, types and sizes to reduce air temperature effectively in urban blocks. The research sites were six high developed blocks in Seoul, Korea. Air temperature was measured with mobile loggers in clear daytime during summer, from August to September, at screen level. Also the measurement repeated over three times a day during three days by walking and circulating around the experimental blocks and the control blocks at the same time. By analyzing spatial characteristics, the averaged air temperatures were classified with three spaces, sunny spaces, building-shaded spaces and small green spaces by using Kruskal-Wallis Test; and small green spaces in 6 blocks were classified into their outward forms, polygonal or linear an

We first remove extra text in the abstract (eposter) and (Invited) + convert unicode characters

In [5]:
import unicodedata
def clean_title(text):
    if text.split(' ')[-1] == '(Invited)':
        text = ' '.join(text.split(' ')[:-1])
    text = unicodedata.normalize('NFKD', text).encode('ascii','ignore')
    text = text.replace('\n', ' ')
    return text

def clean_abstract(text):
    if text.split('\n')[0].split(' ')[0] =='ePoster':
        text = ' '.join(text.split('\n')[1:])
    text = unicodedata.normalize('NFKD', text).encode('ascii','ignore')
    text = text.replace('\n', ' ')
    return text

titles = map(clean_title,titles)
abstracts = map(clean_abstract,abstracts)

In [6]:
idx = 300
print '\n The title of this paper is : \n %s'%(titles[idx])
print '\n The corresponding abstract is :\n %s'%(abstracts[idx])


 The title of this paper is : 
  Slow Climate Velocities in Mountain Streams Impart Thermal Resistance to Cold-Water Refugia Across the West

 The corresponding abstract is :
 The purpose of the study is to find small greens disposition, types and sizes to reduce air temperature effectively in urban blocks. The research sites were six high developed blocks in Seoul, Korea. Air temperature was measured with mobile loggers in clear daytime during summer, from August to September, at screen level. Also the measurement repeated over three times a day during three days by walking and circulating around the experimental blocks and the control blocks at the same time. By analyzing spatial characteristics, the averaged air temperatures were classified with three spaces, sunny spaces, building-shaded spaces and small green spaces by using Kruskal-Wallis Test; and small green spaces in 6 blocks were classified into their outward forms, polygonal or linear and single or mixed. The polygonal and 

## Tokenizing and stemming

The texts above cannot be fed directly into an algorithm to identify pattern within the corpus.
Instead, some form of vectorization have to be used.

A bag of words method is generally used in that purpose. In such a method, each document is first broken into unbreakable tokens and, in this method:

- each individual token occurrence frequency (normalized or not) is treated as a feature.
- the vector of all the token frequencies for a given document is considered a multivariate sample.

In the following, I build a function based on the library **nltk** to broke a documents into its most basic form, a list of stems.

To do so, 

- We first broke each documents in sentence and the words
- Remove all numeric, ponctuations with the module **re**
- Remove stowords from the tokens
- Stem each resulting tokens

In [7]:
import nltk

# Load SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token.lower())
    filtered_tokens = [token for token in filtered_tokens if token not in stopwords]
    stems = map(stemmer.stem,filtered_tokens)
    return map(str,stems)

# tf-idf representation 

Using the **tokenize_and_stem** function, we can then break each document in tokens and vectorize it. The vectorization will result in a matrix whom 

- each line is a text document of the corpus
- each column represent a token from the corpus
- a specific value represent the number of occurences of a particular token in a particular document

While **sklearn** provides a class **CountVectorizer** which can be used to form such a matrix, this representation often put to much weights on common words of the corpus, i.e. words like *the*, *a* in english. While this can be interesting in some situations, we would like to put more weight on words that make each abstract specific for our application.

One way to do this is to use a **Tf-idf** normalization to re-weiht each value in the matrix by the frequency of the token in the whole corpus. **Tf** means term-frequency while **tf–idf** means term-frequency times inverse document-frequency.

This normalization is implemented by the **TfidfTransformer** class of sklearn and is used in the following.

Main parameters:

- max_df : When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- min_df : When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold
- ngram_range : Range of n-gram to consider. Bi-gram and Three gram like floor fractured craters is preserved

## Small sub-routine

In [8]:
# Small routine to save the idf class and the resulting matrix.
model_saved = os.path.join(home,'Documents','repos','agu_data','Notebook','Models')
from sklearn.externals import joblib
def save_idf(folder,idf_vectorizer,idf_matrix):
    joblib.dump(idf_vectorizer,os.path.join(folder,'idf_vectorizer.pkl'))
    joblib.dump(idf_matrix,os.path.join(folder,'idf_matrix.pkl'))
    
def open_idf(folder):
    idf_vectorizer = joblib.load(os.path.join(folder,'idf_vectorizer.pkl'))
    idf_matrix = joblib.load(os.path.join(folder,'idf_matrix.pkl'))
    return idf_vectorizer, idf_matrix

## Titles 

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Set true if necessary to rebuild the model !
fit_transform = True
if fit_transform:
    titles_tfidf_vectorizer = TfidfVectorizer(analyzer = 'word',
                                   max_df=0.95, 
                                   max_features=200000, 
                                   min_df=0.001, 
                                   stop_words=stopwords,
                                   use_idf=True, 
                                   tokenizer=tokenize_and_stem,
                                   lowercase = True,
                                   ngram_range=(1,3))
    %time idf_titles = titles_tfidf_vectorizer.fit_transform(titles)
    folder_titles = os.path.join(model_saved,'tfidf','titles')
    save_idf(folder_titles,titles_tfidf_vectorizer,idf_titles)
else:
    folder_title = os.path.join(model_saved,'tfidf','titles')
    titles_tfidf_vectorizer, idf_titles = open_idf(folder_title)

CPU times: user 19.7 s, sys: 327 ms, total: 20 s
Wall time: 20.2 s


## Abstracts 

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Set true if necessary to rebuild the model !
fit_transform = True
if fit_transform:
    abstracts_tfidf_vectorizer = TfidfVectorizer(analyzer = 'word',
                                   max_df=0.9, 
                                   max_features=200000, 
                                   min_df=0.01, 
                                   stop_words=stopwords,
                                   use_idf=True, 
                                   tokenizer=tokenize_and_stem,
                                   lowercase = True,
                                   ngram_range=(1,3))
    %time idf_abstracts = abstracts_tfidf_vectorizer.fit_transform(abstracts)
    folder_abstracts = os.path.join(model_saved,'tfidf','abstracts')
    save_idf(folder_abstracts,abstracts_tfidf_vectorizer,idf_abstracts)
else:
    folder_abstracts = os.path.join(model_saved,'tfidf','abstracts')
    %time abstracts_tfidf_vectorizer, idf_abstracts = open_idf(folder_abstracts)

CPU times: user 5min 6s, sys: 5.96 s, total: 5min 11s
Wall time: 5min 17s


## Making sense of the tf-idf resulting arrays 

In [206]:
import pandas as pd

def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return 
        them with their corresponding feature names.'''
    
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

In [259]:
top_tfidf_feats(np.squeeze(idf_abstracts[0].toarray()),abstracts_tfidf_vectorizer.get_feature_names(),5)

Unnamed: 0,feature,tfidf
0,storm,0.601231
1,inner,0.352726
2,magnetospher,0.33749
3,two,0.173421
4,moder,0.166026


In [260]:
top_tfidf_feats(np.squeeze(idf_titles[0].toarray()),titles_tfidf_vectorizer.get_feature_names(),5)

Unnamed: 0,feature,tfidf
0,differ,0.53137
1,storm,0.528002
2,larg,0.519313
3,global,0.411307
4,zone observatori,0.0
