### <span style="color:green"> The Plug!!! </span>

<img src="images/bookcover.jpg",style="width: 400px;"> 

### Brian Carter ,  Data Scientist , IBM Dublin, Ireland 

#### Talk

* Background
* Cleaning the Data
* Classificaiton
* Clustering
* Topic Modelling 

### Background

* Mining unstructured text from a website, <a href="http://www.pillreports.net/index.php?page=display_pill&id=34892" target="_blank">pillreports.net</a>

  
* Same as any review site, except its focus is on Ecstasy culture

<img src="images/ecstasy.jpg",style="width: 350px;">





### The Idea

* contents are not known until the time of consumption<br><br>

* review sites may be viewed as bridge between that knowledge gap<br><br>

* **The Noble Idea** 
    * flag instances where an <font color="red"><strong>identifiable</strong></font> pill is producing <b>undesirable</b> effects

### Getting the data

* Connect to the webpage(s) - easy incremental ID  -  <a href="https://docs.python.org/2/library/urllib2.html" target="_blank"><font color = "green">(urllib2)</font> </a>


* parse the HTML structure into working format - <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank"><font color = "green">(BeautifulSoup)</font></a>


* connect to MongoDB server, create database and save **report** and associated **comments** into two collections - <a href="https://api.mongodb.org/python/current/" target="_blank"><font color = "green">(PyMongo)</font></a>


### Getting the the right data

* Each page has the same structure - 3 HTML tables


* The **report** is in table[2] and the **comments** in table[3]


* Not all reports have the same fields
    * <font color = blue> I didn't get out a ruler and measure the <strong>width</strong> and <strong>height</strong> of each pill* </font>


##### For each webpage:

* Using  Python *dictionary*, 1st column are the *keys* , 2nd column are the *values*


* complete some basic cleaning *(remove white space, extra HTML tags)*


* dictionary inserted as a document into MongoDB collection

<a href="https://nbviewer.jupyter.org/github/iBrianCarter/pillreports_python/blob/master/2.Web%20Scraping.ipynb" target="_blank"><font color = "green">Scrape Code</font></a>



### Personal Lesson 1 - <font color = red> Character Encoding in Python 2.7  </font>

In [1]:
eng="thank you"
hun="köszönöm"
print ("String Length: ", len(eng), "Type: ", type(eng))
print(hun, "String Length: ", len(hun), "Type: ", type(hun))
#k,s,z,n,m print correctly

('String Length: ', 9, 'Type: ', <type 'str'>)
('k\xc3\xb6sz\xc3\xb6n\xc3\xb6m', 'String Length: ', 11, 'Type: ', <type 'str'>)


In [2]:
eng="thank you"
hun=u"köszönöm"

print(eng, "String Length: ", len(eng), "Type: ", type(eng))
print(hun. encode('ascii' , 'replace' ), "String Length: ", len(hun),
"Type: ", type(hun))

('thank you', 'String Length: ', 9, 'Type: ', <type 'str'>)
('k?sz?n?m', 'String Length: ', 8, 'Type: ', <type 'unicode'>)


### str and code-points

* Python 2.7 has two **basestring** types; <font color = green>str</font>  and <font color = green>unicode</font>


* **str** are bytes whereas **unicode** is composed of unicode code-points.


* In the first example the *O-diaeresis* takes two bytes and len() counts the number of bytes


* In the second example the **u** prefix tells the python interprator that should be represented as **code-points** and counts the correct number of code points. 
    * <font color = red> (In Python 3 this is reversed and unicoe is standard a prefix **b** is required to create a **str** type)</font>
    

* in the second example the encode was changed and the *O-diaeresis* didn't print, no value in the look up table. 

### encode, ecoding, Unicode

* The misues and interchange of the terms (encode, encoding, Unicode) can cause confusion. 


* **Unicode-integer** or code point is **encoded** according to a selected **encoding** standard that translates it to bytes


* Bytes are sent and decoded by the relevant encoding to get the code-point and its character representation. 


<img src="images/PR_Images_6.png",style="width: 800px;"> 



<a href="https://www.youtube.com/watch?v=sgHbC6udIqc" target="_blank"><font color = "green">Great Video: How do i stop the pain</font></a>

### Cleaning the data

* Read data into pandas.DataFrame <br><br>
* clean up the data formats  <a href="https://docs.python.org/2/library/datetime.html" target="_blank"><font color = "green">(datetime)</font> </a>


* tidy integers with textual units (mm)  - regular expressions 


*  split fields into new columns (Report Quality Rating - 3.35 stars, 3 votes) etc.


* geo coding - very messy  *(SoCal = Calafornia, USA ; Vic = Victoria, Austrailia)* 


* determine the language of a report - <a href="https://pypi.python.org/pypi/langdetect/1.0.1" target="_blank"><font color = "green">(langdetect)</font> </a>

    * Python port of a Google created Java library for language detection
    * developed with labelled Wikipedia articles
    * Naive Bayes ruleset, can detect 49 languages with 99% accuracy, minimum 10-12 words depending on the language
    
    
   
<a href="https://nbviewer.jupyter.org/github/iBrianCarter/pillreports_python/blob/master/3.Data%20Cleansing.ipynb" target="_blank"><font color = "green">Cleaning Code</font></a>


In [3]:
import pandas as pd
df=pd.read_csv("C:\Users\IBM_ADMIN\Desktop\Python_Ireland_Nov15\Python_Ireland_Nov15\slides\images\lang.csv")

In [4]:
df

Unnamed: 0,origin/language,en,nl,tr,pl,ru,af,no,Unknown,All
0,All,2962,32,20,2,2,1,1,1981,5001
1,USA,1065,0,0,0,0,0,0,986,2051
2,Australia,591,0,0,0,0,0,0,378,969
3,Netherlands,208,29,0,0,0,0,0,127,364
4,England,223,0,0,0,0,0,0,101,324
5,Unknown,160,0,1,0,0,1,1,118,281
6,Ireland,122,0,0,0,0,0,0,95,217


### Visualising and Exploring the Data

* Using a mixture of <a href="http://matplotlib.org/" target="_blank"><font color = "green">(matplotlib)</font> </a> and  <a href="http://stanford.edu/~mwaskom/software/seaborn/" target="_blank"><font color = "green">(seaborn)</font> </a>


* matplotlib is very old (very large), very slow (number of lines of code) and very large


* seaborn is new, not as flexible but very quick - focus on statisical 



* Never ends always exploring when moving into the mining


<a href="https://nbviewer.jupyter.org/github/iBrianCarter/pillreports_python/blob/master/4.Data%20Visualization.ipynb" target="_blank"><font color = "green">Visualisation Code</font></a>

<img src="images/PR_Images_16a.png",style="width: 800px;"> 
<img src="images/PR_Images_16b.png",style="width: 800px;"> 

* 12 & 8 lines of code

<img src="images/PR_Images_17a.png",style="width: 800px;"> 

16 lines of code

### Classifcation 

- Build a simple Naive Bayes Model using <b>Description</b> free text field as input features and <b>Warning</b> as target label


- Filter rows to those in English


- Create different representation of the text (stopwords, stemming, ngrams) - <a href="http://www.nltk.org/" target="_blank"><font color = "green">(nltk)</font> </a>


- Use different vector weighting (binary, TFIDF)


- Apply other algorithms 



<a href="https://nbviewer.jupyter.org/github/iBrianCarter/pillreports_python/blob/master/5.Classification.ipynb" target="_blank"><font color = "green">Classification Code</font></a>


In [5]:
classDF=pd.read_csv("C:\Users\IBM_ADMIN\Desktop\Python_Ireland_Nov15\Python_Ireland_Nov15\slides\images\class.csv")
resultsDF=pd.read_csv("C:\Users\IBM_ADMIN\Desktop\Python_Ireland_Nov15\Python_Ireland_Nov15\slides\images\score.csv")

In [6]:
classDF

Unnamed: 0,Warning,All Rows,All,English Rows,English
0,Yes,3184,0.64,1888,0.65
1,No,1817,0.36,1008,0.35


In [7]:
resultsDF

Unnamed: 0.1,Unnamed: 0,# Features,1.Model,2.Vector,3.Train Acc.,4.Train std.,5.Test Score
0,2,2785,NB,Freq-2,0.722089,0.014371,0.725862
1,23,16847,SGD,TFIDF-6,0.721546,0.023935,0.705172
2,0,2905,NB,Freq-1,0.720362,0.016211,0.732759
3,17,11819,SGD,TFIDF-3,0.714644,0.023743,0.7
4,19,16713,SGD,TFIDF-4,0.711185,0.021285,0.706034
5,21,7417,SGD,TFIDF-5,0.704266,0.026029,0.693103
6,11,16847,SGD,Freq-6,0.70082,0.027605,0.700862
7,13,2905,SGD,TFIDF-1,0.69622,0.018576,0.681034
8,9,7417,SGD,Freq-5,0.696219,0.034046,0.668966
9,7,16713,SGD,Freq-4,0.695078,0.031401,0.668103


### Understanding of classification boundary

* Select top predictive features

* Visualise 

<img src="images/img5_6a.png",style="width: 800px;"> 

<img src="images/PR_Images_25.png",style="width: 800px;"> 

# Clustering

- Using the <b>User Report</b> free text field as input features


- Create a visualisation of the points based on the features PCA representation.


- Filter rows to those where the language is English and by Country


- Using part of speech tagging, visualisation rich words in a word cloud


<a href="https://nbviewer.jupyter.org/github/iBrianCarter/pillreports_python/blob/master/6.Clustering%20%26%20PCA.ipynb" target="_blank"><font color = "green">Clustering Code</font></a>



### PCA

* Create a binary occurence vector representation of stemmed words (Porter) 


* Normalise vector with tfidf


* perform PCA and return 2 components 


* 3.5% of total variance explained 


* Use first two principal component to create scatter plot and overlay some columns with nominal categories

<img src="images/PR_Images_27.png",style="width: 1200px;"> 

### Fun Word Cloud

* Using POS tagger extract adjectives and nouns

* Create a wordcloud of sized by word count and segmented by drug


<img src="images/PR_Images_30.png",style="width: 800px;"> 

### Topic Modelling 

- Late Addition (not in the book)



- Apply Topic Modelling to the data using  <a href="https://radimrehurek.com/gensim/" target="_blank"><font color = "green">(gensim)</font> </a>


- Determine if there are latent topics in the **User Report** column



- New port of **LDAvis** R package for interactive topic model visualization  <a href="https://pypi.python.org/pypi/pyLDAvis" target="_blank"><font color = "green">(pyLDAvis)</font> </a>


- **JUMP TO NOTEBOOK**

### <span style="color:green"> Questions? Comments? Ridicule!!! </span>

<img src="images/bookcover.jpg",style="width: 400px;"> 