# Text analysis (2)

Here we discuss the steps which precede the text mining part in itself. That is mostly about how one extracts and prepare the data.
For an introduction to the modeling part and NLP I suggest:
- scikit-tutorial part 2
- *Natural Language Processing with Python* by Steven Bird, Ewan Klein and Edward Loper.



## Download files
- command line

In [1]:
%%bash
wget https://en.wikipedia.org/wiki/Blois

--2019-02-27 10:46:56--  https://en.wikipedia.org/wiki/Blois
Resolving en.wikipedia.org (en.wikipedia.org)... 2620:0:862:ed1a::1, 91.198.174.192
Connecting to en.wikipedia.org (en.wikipedia.org)|2620:0:862:ed1a::1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 149068 (146K) [text/html]
Saving to: ‘Blois’

     0K .......... .......... .......... .......... .......... 34%  596K 0s
    50K .......... .......... .......... .......... .......... 68%  752K 0s
   100K .......... .......... .......... .......... .....     100%  983K=0,2s

2019-02-27 10:46:56 (740 KB/s) - ‘Blois’ saved [149068/149068]



- requests

In [2]:
import requests

In [16]:
s = "https://fr.wikipedia.org/wiki/Blois"

In [17]:
resp = requests.get(s)

In [18]:
resp.ok

True

In [19]:
resp.text

'<!DOCTYPE html>\n<html class="client-nojs" lang="fr" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Blois — Wikipédia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Blois","wgTitle":"Blois","wgCurRevisionId":156053817,"wgRevisionId":156053817,"wgArticleId":10310,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Page utilisant une frise chronologique","Page avec coordonnées similaires sur Wikidata","Article géolocalisé en France","Article géolocalisé sur Terre","Article avec modèle Infobox Commune de France","Page utilisant un sommaire limité","Article avec module Population de France","Article à référence nécessaire","Article avec modèle Blason-vill

In [20]:
with open('Blois2.html','tw') as f:
    f.write(resp.text)

## Crawl the web

If page contain javascript code (links, dynamic text, ...), it become necessary to use a special software to "fake" a real navigation.
Check:
- selenium
- phantomJS

## Text extraction: analyse html file

### option 1: convert to text (htmltotext) + regex


In [15]:
html2text

NameError: name 'html2text' is not defined

### option 2: by using the webpage structure

Some more languages:
    - html
    - xml
    - css

Html pages have a dom object
    
XML and CSS provide a way to access particular elements of a webpage. Use developer tools in Chrome to get the right path.

Use libraries:
- beautifulsoup
- lxml

In [None]:
//*[@id="mw-content-text"]/div/table[1]/tbody/tr[18]/td/text()

## Storing the text data

- small data: 

    - python's `shelve` (like a dict, but on disk)
    
- big data: database

    - mongodb: stores many documents as json documents, holds unstructured data easily
    - sqlite: on-drive sql variant
    - sql server

In [21]:
import shelve

In [25]:
sh = shelve.open("results")

In [26]:
sh['a'] = 34
sh['article1'] = "IHOIHOIH "

In [27]:
sh.close()

## Feature extraction

Before any model can be trained, the text must be transformed in an algorithm-friendly way, i.e. vectors of features.

### Bag of words

One naive approach consists in counting the number of occurrences of any given word, for all documents.

|  _    | recession | france  | japan | fuji |
|-------|-----------|---------|-------|------|
| doc1  |  1        | 3       |  0    | 0    |
| doc2  |  4        |  0      | 3     |  1   |
| doc3  |  0        | 0       | 2     | 1    |

There is a special fucntion for that, which operate on the whole data set:

```python
from sklearn.feature_extraction.text import CountVectorizer
```


This creates a potentially very large matrix, with many features. It is then required to store this matrix in a sparse format.

Because, larger documents have more words, it is more informative to store frequencies, that is to divide by the total number of words in a document to obtain term frequencies (tf). One might also want to downscale words that appear in many documents, by dividing by overall frequency. This is called “Term Frequency times Inverse Document Frequency” (tf-idf). Both actions are supported by class:

```python
from sklearn.feature_extraction.text import TfidfTransformer
```

Data is then ready to fit a model.

### Other approaches

- tokenization (with python nltk)
- phonetization
- deep learning model