# Presentation of the Python Analysis Pipeline
This work book presents the three steps of the analysis pipeline from fetching blogposts in the database to analyzing the content.

In [3]:
import sys
from os import getcwd, chdir
from os.path import split
currdir = split(getcwd())
if currdir[1]== "Notebooks":chdir(currdir[0])
print getcwd()
%load_ext autoreload
%autoreload 2

X:\husser\Code\Python\NeuralNetworks
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Fetching data 
The analyzed data is fetched from a sample sqlite database

In [4]:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from core.scraper import models
import numpy as np
import pandas as pd
DATABASE_CONNECT = "sqlite:///data/scrape.sqlite"
N_ARTICLES = 25

ImportError: No module named core.scraper

In [8]:
engine = create_engine(DATABASE_CONNECT)
DBSession = sessionmaker()
DBSession.configure(bind = engine)
session = DBSession()
corpus = session.query(models.WebArticle).limit(N_ARTICLES)
posts = np.array([[i.Title,i.Body] for i in corpus])

In [9]:
corpus = pd.DataFrame(posts, columns=["Title","Body"])
corpus.head()

Unnamed: 0,Title,Body
0,Grèce : ce qui bloque encore,La Grèce et ses créanciers (Fonds monétaire in...
1,Auto à bas prix : l'histoire se poursuit,Concevoir une voiture bon marché n'a rien d'un...
2,"En Espagne, Podemos inquiète les entreprises",Quatre années ont passé depuis le temps où les...
3,Vivendi et Orange lorgnent Telecom Italia,"C’est une phrase prononcée comme ça, au détour..."
4,L’oud envoûtant d’Anouar Brahem,Il a osé. Malgré les écueils et les doutes. Le...


## 2. Text Vectorization
The text vectorization is operated with the Scikit-Learn framework.
Further programming: use the Hadoop ecosystem for parallel processing

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [24]:
PARAMETERS_VECTORIZER = {
    "vocabulary":None,
    "ngram_range":(1,3),
    "max_df":0.8,
    "min_df":3,
    "encoding" : "utf-8",
    "strip_accents" : 'ascii',
    "norm":'l2',}

In [25]:
vectorizer = TfidfVectorizer(**PARAMETERS_VECTORIZER)
vectorizer.fit_transform(posts[:,1])

<25x444 sparse matrix of type '<type 'numpy.float64'>'
	with 2166 stored elements in Compressed Sparse Row format>

In [43]:
tfidf = vectorizer.fit_transform(posts[:,1]).toarray()
print("No of articles:\t%d\nNo of features:\t%d"%tfidf.shape)

No of articles:	25
No of features:	444


In order to keep track of the article data, the list of features is also extracted

In [44]:
vocabulary = np.array(vectorizer.vocabulary_.items(), dtype=("a18,i4"))
df = pd.DataFrame(vocabulary)
words = pd.Series(df.f0, index=pd.Index(df.f1))
words[50]

'scene'

## 3. SOM Analysis
The SOM Mapper takes the tfidf matrix and performs the mapping from the input space (high dimensional) into the 2D output space

In [68]:
from core.som import som

In [69]:
PARAMETERS_SOM = {
    "kshape" : (10,10),
    "n_iter" : 150,
    "learning_rate" : 0.01,
    "initialization_func" : None,
    "topology" : "rect"}

In [70]:
mapper = som.SOMMapper(**PARAMETERS_SOM)

In [71]:
kohonen = mapper.fit_transform(tfidf)

In [72]:
kohonen

array([[ 0.07406778,  0.03076218,  0.00454896, ...,  0.03582369,
         0.02660959,  0.00959145],
       [ 0.07015325,  0.03120137,  0.00285738, ...,  0.02885604,
         0.02197417,  0.00903481],
       [ 0.05724074,  0.03618747,  0.00675572, ...,  0.02711527,
         0.02901932,  0.01200897],
       ..., 
       [ 0.00697336,  0.02343737,  0.01063956, ...,  0.01031934,
         0.00173255,  0.00390416],
       [ 0.0115377 ,  0.02660159,  0.01152986, ...,  0.01037489,
         0.00312459,  0.00322342],
       [ 0.01176909,  0.03013724,  0.01148132, ...,  0.00982748,
         0.00850772,  0.00566853]])

In [73]:
articlesNodes = mapper.predict(tfidf)

In [74]:
len(articlesNodes)

25

In [75]:
id = 15
print corpus.ix[id].Title
print corpus.ix[id].Body[:150]

Quand l’ONU, gage de paix dans le monde, recourt à des mercenaires
C’est une dérive passée inaperçue, mais qui commence à poser problème : pour défendre ses troupes dans des zones dangereuses, l’ONU a de plus en plus 


In [76]:
bmu = mapper.predict(tfidf)
neuron = kohonen[bmu[0],:]
best_features = np.argsort(neuron)[::-1][:20]
print "Best matching unit", bmu
print "Best features:",best_features
print "Best frequencies:",neuron[best_features[:10]]

Best matching unit [59  9 93  1  9 63  9 13 40 40  5 20 80 79 39 96 99  5 77 42 75 45 28 80 47]
Best features: [ 86  28 361 413 327 318 284  72 388  52 310 193 210 137 313 157 199 115
  45 326]
Best frequencies: [ 0.161845    0.11118024  0.10218823  0.09936342  0.09782284  0.08851
  0.0871634   0.08255868  0.08180177  0.07786587]


In [77]:
words[best_features].values.tolist()

['decide',
 'peuvent',
 'la montee',
 'on',
 'plus tot',
 'joue',
 'paris et',
 'coalition',
 'general',
 'sur les',
 'chefs',
 'uvre',
 'est un',
 'mai',
 'electorale',
 'par une',
 'la semaine',
 'passe',
 'devant',
 'en france']

In [82]:
close_articles = np.where(np.abs(articlesNodes-bmu)<1)
print close_articles
print "best matching unit", bmu

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24], dtype=int64),)
best matching unit [59  9 93  1  9 63  9 13 40 40  5 20 80 79 39 96 99  5 77 42 75 45 28 80 47]


In [83]:
for i in close_articles:
    print mapper.predict(tfidf[i,:])
    print corpus.ix[i].Title
    print corpus.ix[i].Body[:200]

[59  9 93  1  9 63  9 13 40 40  5 20 80 79 39 96 99  5 77 42 75 45 28 80 47]
0                          Grèce : ce qui bloque encore
1              Auto à bas prix : l'histoire se poursuit
2          En Espagne, Podemos inquiète les entreprises
3             Vivendi et Orange lorgnent Telecom Italia
4                       L’oud envoûtant d’Anouar Brahem
5                                  Jeanne la Berlinoise
6     Le numéro 2 du Medef cède à la mode du crowdfu...
7          « Nous sommes en guerre, et pour longtemps »
8        Singapour a désormais sa Pinacothèque de Paris
9                                 Sur le pied de guerre
10    La mauvaise culture économique des Français pa...
11                    Le couple à l’épreuve de l’argent
12    La perspective d’une cession de Findus mobilis...
13     Les enjeux des élections législatives en Turquie
14    Etat islamique : la stratégie dévastatrice des...
15    Quand l’ONU, gage de paix dans le monde, recou...
16    Les élections seront-

## 4. Clustering
Clustering is operated by the K-means algorithm

In [259]:
from sklearn.cluster import KMeans

In [260]:
n_clusters = 5
PARAMETERS_KMEANS = {
    "n_clusters": n_clusters,
    "init": 'k-means++',
    "max_iter":100,
    "n_init":10}

In [261]:
km = KMeans(**PARAMETERS_KMEANS)

In [262]:
km.fit(kohonen)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=5, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [263]:
nodesClusters = km.predict(kohonen)
clusters = km.cluster_centers_

In [264]:
print("Top terms per cluster:")
order_centroids = clusters.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print("Cluster %d:" % i)
    print [words[ind] for ind in order_centroids[i,:10-1]]
    print

Top terms per cluster:
Cluster 0:
['entre', 'contre', 'moins', 'pays', 'meme', 'par les', 'premiere', 'nouveau', 'peut']

Cluster 1:
['moins', 'premiere', 'contre', 'notamment', 'sur la', 'pays', 'plusieurs', 'par les', 'monde']

Cluster 2:
['sur la', 'moins', 'et de', 'meme', 'monde', 'peut', 'contre', 'plusieurs', 'par les']

Cluster 3:
['contre', 'et le', 'sur la', 'meme', 'que les', 'ne', 'etait', 'moins', 'avant']

Cluster 4:
['que', 'contre', 'par les', 'fin', 'dune', 'leur', 'plusieurs', 'nouveau', 'monde']

