# PHASE 1: Platform Classification using Supervised Learning Methods - Building Training and Test datasets (part 1)

This script is being used to test supervised learning methods to classify the platforms (ie. the resources commented on freeCodeCamp social media).

The script has been mainly used to start the construction of an unexistent labelled dataset that could be eventually used for training and validation. Simultaneously it is being used to have some idea of the feasibility of a ML approach for classification, as well as to compare some methodologies that could be used in full implementation later on.

## METHOD

### The Categories

The categories were a construct that seemed to respond to the main types of material consulted and shared by users in the main file. However, the resulting categories were identified *after* the selection/inclusion rules of resources (see next) so it is biased to that procedure.

The categories are curated and can be consulted in the following folder:

* https://github.com/evaristoc/fCC_R3_DataAnalysis/tree/development/docs

### Data Preparation: The primer labelled file

First source of labelled data came from the application of hard-coded conditional rules based on regex () or url domain words (eg. "api" for api, ["devs", "docs", etc.] for docs, ["forum", "chat", etc] for community). The script with the hard code rules might confuse the reader and therefore won't be attached to this script. There were also rules to pre-select the kind of platforms to be treated. Some of the rules were:

* No gitter, memes, youtube, github, codepen, freecodecamp.com were included. The list of exclusion is longer.
* Also excluded were those that ended in image format, like \*.gif, or were scripts (eg. \*.js).
* From those that passed, those referring to topics like javascript, react, angular and others were included. The list of inclusions is longer.

The accuracy of the hard coded rules procedure ended up around 60% after re-assigning classes based on personal judgement and comparing the resulting classifications.

Data used for classification were based on extracting information through a bot from the main page of each resource and complemented with existing wikipedia sources about the resource. Data was not complete from all the resources and the text length and reliability of the information varied.

The name of the file with the first revised classifications is `primer_classes_rev.csv`. This file was used as primer for further classifications.


### Analysis

The supervised methods tested so far were **Multinomial Naïve Bayesian (MNB)** and **Decision Trees (DT)**. Their selection was motivated by their simplicity and expected robustness when compared to more complex methods. A simple robust method was preferred because the small amount of data available. They are also frequenly used for document classification (https://en.wikipedia.org/wiki/Document_classification). I am also preferring a classifier for this case that could cope with an assumed nonlinear distribution.

The incremental construction consisted in using a small training dataset for predicting the classes of unobserved records using either MNB or DT, verifying then the classes by inspecting resources online, assigning a class to a sample of records (personal judgement), evaluating accuracy, and finally increasing the size of the training dataset with the newly annotated data to classify records that were not added to the sample. Then repeating the process until an arbitrary size.

The analyses were very simple and focused in simple calculations of accuracy and precision in spreadsheets, used to compare predicted classes vs the assigned classes at each iteration.
**Be aware that** because of part of the activities were made in spreadsheets and didn't use a script, the procedure is unfortunately not fully recorded and some steps are missing. Not all the files are provided either.

## RESULTS so far...

The accuracy of the classification has been between 50%-60%, better for DT than for MNB. Accuracy was acceptable for this phase and amount of data, very close to the hard code rule implementation, but probably insufficient for an stand-alone implementation, requiring manual supervision. Precision of certain classes have also been analysed, being relatively high for a few of them. Recall seemed to be very low in all steps.

Another important result was that MNB tended to bias to the most numerous categories. Some tuning of the alpha-parameter of MNB was tried but the results at this stage were not useful.


## OBSERVATIONS

One thing worth mentioning is that the classification is becoming *highly unbalanced*.

One reason is that I am using my own judgement for the classification: although some platforms are easy to classify there are some of them that are between two classifications.

The other reason is that in fact the selection of the users felt mostly in certain type of resources. Blogs for example comprise a large part of the resources that users consult and mention. Similarly, it has to do with the actual distribution of the different types of platforms on the web: for example it is very likely that the number of bloggers is much bigger than the number of online learning platforms that actually exist, so even if the user wanted to find more learning platforms than blogging sites, that would not be possible because it doesn't reflects the actual distribution of site types on Internet.

Another challenge of this project has been that those kind of resources might have been used differently per Gitter channel. Some channels might rely more on guides while others were more busy consulting and sharing frameworks or api's.

**Be also aware that** the code below is not following a strict, validated methodology. One reason is the small amount of data available when the tests started (10/7/2017). This is also a preliminary test of a methodology.

## Main Python libraries

In [None]:
import os, sys, pathlib
from IPython.display import display, Math, Latex #also '%%latex' magic command
import collections, itertools, operator, re, copy, datetime
import urllib, urllib.request, urllib.parse, dns, ipwhois
import pickle, json, csv, zipfile
import math, random, numpy, scipy, pandas
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import bs4
import nltk, sklearn

nltk.data.path.append(config.anacondadir+'nltk_data') #an unfortunate hack for now... need to create a relative link

In [None]:
import imp
try:
    imp.find_module('bs4')
    found = True
except ImportError:
    found = False

found

### References:
* https://stackoverflow.com/questions/82831/how-do-i-check-whether-a-file-exists-using-python
* https://stackoverflow.com/questions/14050281/how-to-check-if-a-python-module-exists-without-importing-it
* http://www.dnspython.org/examples.html
* https://stackoverflow.com/questions/24580373/how-to-get-whois-info-by-ip-in-python-3

* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* https://web.stanford.edu/class/cs124/lec/naivebayes.pdf

## Data Preparation 1 : primer TRAINING DATASET created using hard coded rules over Gitter HelpBackEnd chatroom (Jun-16 / Mar-17)

### LOADING primer AND ADDING AND PARSING BOT AND WIKIPEDIA TEXT DATA

In [None]:
directory = config.directory

In [None]:
if not pathlib.Path(directory+'primer_classes_rev.csv').is_file():
    with open(directory+'primer_classes_rev.csv','w') as outfile:
        csvfile = csv.writer(outfile)
        csvfile.writerow(['platform','class'])
        for k,v in backendclass.items():
            csvfile.writerow([k,v])

In [None]:
if pathlib.Path(directory+'primer_classes_rev.csv').is_file():
    pd_primerclass = pandas.read_csv(open(directory+'primer_classes_rev.csv', 'r'))

In [None]:
pdbackendclass.head()

In [None]:
#prepare this one to save data into the pandas dataset
#a merge would have been MUUUUUUUUUCH better -- change this for a merge instead!
def getting_treated_links(data, filename):
    platformlist = data['platform']
    with open(directory+filename+'_platforms_data.pkl','br') as infile:
        botdata = pickle.load(infile)
        #print(crawled)
        for k1 in list(botdata.keys()):
            if k1 in list(platformlist):
                if botdata[k1]['title'] == -1 or botdata[k1]['title'] == 0 or botdata[k1]['title'] == None:
                    data.loc[data['platform'] == k1,'title'] = ''
                else:
                    data.loc[data['platform'] == k1,'title'] = botdata[k1]['title']
                if botdata[k1]['description'] == -1 or botdata[k1]['description'] == 0 or botdata[k1]['description'] == None:
                    data.loc[data['platform'] == k1,'description'] = ''
                else:
                    data.loc[data['platform'] == k1,'description'] = botdata[k1]['description']
                if botdata[k1]['keywords'] == -1 or botdata[k1]['keywords'] == 0 or botdata[k1]['keywords'] == None:
                    data.loc[data['platform'] == k1,'keywords'] = ''
                else:
                    data.loc[data['platform'] == k1,'keywords'] = botdata[k1]['keywords']
                if botdata[k1]['htext'] == -1 or botdata[k1]['htext'] == 0 or botdata[k1]['htext'] == None:
                    data.loc[data['platform'] == k1,'htext'] = ''
                else:
                    data.loc[data['platform'] == k1,'htext'] = botdata[k1]['htext']
                if botdata[k1]['params'] == -1 or botdata[k1]['params'] == 0 or botdata[k1]['params'] == None:
                    data.loc[data['platform'] == k1,'params'] = ''
                else:
                    data.loc[data['platform'] == k1,'params'] = ",".join([x for x in botdata[k1]["params"]])
    return data


In [None]:
pd_primerclass.query("'techboyzzz.wordpress.com' in platform")

In [None]:
pd_primerclass[pd_primerclass['platform'] == 'techboyzzz.wordpress.com']['newclass']

In [None]:
pd_primerclass['title'] = ''
pd_primerclass['description'] = ''
pd_primerclass['keywords'] = ''
pd_primerclass['htext'] = ''
pd_primerclass['params'] = ''

pb_platformsdata1 = getting_treated_links(pd_primerclass, 'helpbackend1')

In [None]:
data

In [None]:
pb_platformsdata1.loc[pb_platformsdata1['platform'] == 'techboyzzz.wordpress.com',:]

In [None]:
def wikipediasearch(platform):
    title = ''
    while True:
        url = 'https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch='+platform
        req = urllib.request.Request(url)
        resp = urllib.request.urlopen(req)
        respData = resp.read()
        r = json.loads(respData.decode("utf-8"))
        if 'error' in list(r.keys()):
            return title
        if r['query']['search'] != []:
            break
        elif r['query']['search'] == [] and len(platform.split('.')) > 2:
            platform = '.'.join(platform.split('.')[1:])
        elif r['query']['search'] == [] and len(platform.split('.')) <= 2:
            print(platform, ' not found in wikipedia')
            break
    for i,t in  enumerate(r['query']['search']):
        if set(t['title'].lower().replace('.', ' ').split(' ')).intersection(set(platform.split('.'))):
            title = t['title']
            break
        elif set([''.join(t['title'].lower().replace('.', ' ').split(' '))]).intersection(set(platform.split('.'))):
            title = t['title']
            break
    
    print(platform, title)
    return title

In [None]:
def wikipediaextract(title):
    title = title.replace(' ', '%20')
    url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&titles='+title
    req = urllib.request.Request(url)
    resp = urllib.request.urlopen(req)
    respData = resp.read()
    r = json.loads(respData.decode("utf-8"))
    #print(r)
    return list(r['query']['pages'].values())[0]['extract']
    


In [None]:
def souping(extract):
    soup = bs4.BeautifulSoup(extract)
    print(soup.find_all('p')[0].text)
    return soup.find_all('p')[0].text

In [None]:
pdbackendclass['wiki'] = ''

def getting_wikipedia(data):
    for plt in data['platform']:
        title = wikipediasearch(plt)
        print(title, ' in getting wikipedia')
        if title == '':
            data.loc[data['platform'] == plt,'wiki'] = ''
            continue
        extract = wikipediaextract(title)
        wiki = souping(extract)
        #print(wiki)
        data.loc[data['platform'] == plt,'wiki'] = wiki
        #data.loc[data['platform'] == plt,'wiki'] = 1

In [None]:
getting_wikipedia(pd_primerclass)

In [None]:
pd_primerclass.head(10)

In [None]:
#test = 'heroku.com'
#reg = []
#for rdata in dns.resolver.query(test):
#    reg.append(rdata)

In [None]:
#for rd in reg:
#    obj = ipwhois.IPWhois(rdata)
#    res=obj.lookup()
#    print(res)

In [None]:
#pdbackendclass['alltext'] = pdbackendclass['wiki'] + ' ' + pdbackendclass['title'] + ' ' + pdbackendclass['description'] + ' ' + pdbackendclass['keywords'] + ' ' + pdbackendclass['htext']
#pdbackendclass

In [None]:
def datapreparation(data):
    
    usual_stopwords = nltk.corpus.stopwords.words('english')
    other_words = ["re", "fm", "tv", "la", "al", "ben", "aq", "ca", "can", "can'", "can't", "cant", "&"]
    punctuation = ["\\","/", "|","(",")",".",",",":","=","{","}","==", "===","[","]","+","++","-","--","_","<",">","'","''","``",'"',"!","!=","?",";"]
    wtbr = usual_stopwords + other_words + punctuation
    
    pattern01 = re.compile(r'[^a-z0-9]', flags=re.IGNORECASE)
    pattern02 = re.compile(r'\d+', flags=re.IGNORECASE)
    pattern03 = re.compile(r'\w$', flags=re.IGNORECASE)
    
    for plt in data['platform']:
        count = 0
        textlist = ['']
        if data.loc[data['platform'] == plt, 'description'].values[0] != '' and data.loc[data['platform'] == plt, 'description'].values[0] != None:
            if data.loc[data['platform'] == plt, 'description'].values[0] not in ['noinformationfound', 'errorreachingpage']:
                textlist = textlist + re.sub(pattern01, ' ',data.loc[data['platform'] == plt, 'description'].values[0].lower()).split(' ')
                count += 1
        if data.loc[data['platform'] == plt, 'keywords'].values[0] != '' and data.loc[data['platform'] == plt, 'keywords'].values[0] != None:
            if data.loc[data['platform'] == plt, 'keywords'].values[0] not in ['noinformationfound', 'errorreachingpage']:
                textlist = textlist + re.sub(pattern01, ' ',data.loc[data['platform'] == plt, 'keywords'].values[0].lower()).split(' ')
                count += 1
        if data.loc[data['platform'] == plt, 'title'].values[0] != '' and data.loc[data['platform'] == plt, 'title'].values[0] != None:
            if data.loc[data['platform'] == plt, 'title'].values[0] not in ['noinformationfound', 'errorreachingpage']:
                textlist = textlist + re.sub(pattern01, ' ',data.loc[data['platform'] == plt, 'title'].values[0].lower()).split(' ')
                count += 1
        if data.loc[data['platform'] == plt, 'htext'].values[0] != '' and data.loc[data['platform'] == plt, 'htext'].values[0] != None:
            if data.loc[data['platform'] == plt, 'htext'].values[0] not in ['noinformationfound', 'errorreachingpage']:
                textlist = textlist + re.sub(pattern01, ' ',data.loc[data['platform'] == plt, 'htext'].values[0].lower()).split(' ')
                count += 1
                
        for p in data.loc[data['platform'] == plt, 'params'].values[0].split(','):
            allpwds = re.sub(pattern01, ' ', p.lower()).split(' ')
            textlist = textlist + allpwds

        #print(set(textlist))
        
        text = ''
            
        for e in set(textlist):
            #assert type(e).__name__ == str, type(e)
            if (e != '' or e != ' ') and not re.match(pattern02, e) and e not in wtbr:
                #print(e)
                if e in ['rants', 'rant']:
                    e = 'blog'
                text = text + ' ' + e
                    
        data.loc[data['platform'] == plt, 'alltext'] = text

In [None]:
#pdbackendclass['alltext'] = ''
#pdbackendclass.head()

In [None]:
pd_primerclass['alltext'] = ''
datapreparation(pd_primerclass)

In [None]:
pd_primerclass.head(10)

## Data Preparation 2 : TEST DATASET with resources of Gitter HelpFrontEnd chatroom (Jun-16 / Mar-17)

### LOADING TEST FILE

In [None]:
test = pickle.load(open(directory+'test_treateddata_links.pkl','br'))
len(test)


In [None]:
test[list(test.keys())[3]]

In [None]:
pd_test = pandas.DataFrame.from_dict(test)
pd_test = pd_test.transpose()
pd_test
pd_test = pd_test.reset_index()
pd_test = pd_test.rename(columns={'index':'platform'})
pd_test.head(10)

### EXTRACTING FROM TEST DATASET THE ALREADY CLASSIFIED PLATFORMS FOUND IN primer DATASET

Worth mentioning that when running a classifier over only duplicated information found in the test, the classification was almost if not completely perfect (not shown). Extracting them from the test was required to avoid the effect of overfitting. 

In [None]:
pd_primerclass['platform'].values

In [None]:
pd_primerclass[['platform','wiki']].head(10)

In [None]:
pd_testwodup1 = pd_test.loc[~dp_test.platform.isin(pd_primerclass['platform'])]
pd_testwodup1.head(10)

In [None]:
def paramsintostr(x):
    #print(x)
    if x == None:
        return text
    x = ','.join(list(x))
    return x 

In [None]:
pd_testwodup1['params'] = pd_testwodup1['params'].apply(paramsintostr)

In [None]:
pd_testwodup1.head(10)

In [None]:
getting_wikipedia(pd_testwodup1)

In [None]:
pd_testwodup1.head(10)

In [None]:
datapreparation(pd_testwodup1)
pd_testwodup1.head(10)

## FIRST ROUND

### Correcting some classes after manual classification (affected by data formatting in spreadsheets)

In [None]:
pd_primerclass.loc[pd_primerclass.newclass.isin(['learn|tutorial|course|training|',
       'learn|tutorial|course|training| tips|example',
       'learn|tutorial|course|training|tips|example']), 'newclass'] = 'learn|tutorial|course|training| tips|example'

In [None]:
pd_primerclass.loc[pd_primerclass.newclass.isin(['(text )?editor|interpreter|repl', '(text)?editor|interpreter|repl']), 'newclass'] = '(text )?editor|interpreter|repl'

### primer DATA MODELLING - VECTOR MODEL 1

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect_r1 = CountVectorizer(ngram_range=(1,2))
X_primercounts = count_vect_r1.fit_transform(pd_primerclass.alltext)
X_primercounts.shape

### Assigning MNB Classifier Class Reduction to 4 instead of 10

This section was actually added eventually in an effort to verify how the data reduction would affect the ability of the MNB to improve classification. One aspect that I tried to tackle here was the **unbalanced classes**. Few interesting observations were taken but nothing that could use largerly to improve the results.

The code is kept commented in case the reader thinks in trying.

In [None]:
#pd_primerclass['mnb_class'] = pd_primerclass['newclass']
#pd_primerclass.loc[pd_primerclass.mnb_class.isin(['(text )?editor|interpreter|repl', '---',
#       'cloud|platform|service', 'community|support|people|forum',
#       'design|galler|template|theme',
#       'manual|guide|docs',
#       'on?(-|\\s)?demand|business|compan(y|ies)|enterprise',
#       'searchtools', 'shop|commerce']), 'mnb_class'] = 'other'

#pd_primerclass.loc[pd_primerclass.mnb_class == 'other', 'mnb_class'].describe

### NAIVE BAYESIAN CLASSIFICATION 1

Main Reference:
* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* https://stats.stackexchange.com/questions/99667/naive-bayes-with-unbalanced-classes
* http://www.cs.waikato.ac.nz/~eibe/pubs/FrankAndBouckaertPKDD06new.pdf
* coindidentially !! -> http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#sphx-glr-auto-examples-applications-plot-out-of-core-classification-py
* http://www.programcreek.com/python/example/84841/sklearn.feature_extraction.text.CountVectorizer
* http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html (OJO: this is NOT the vectorizer, but a NORMALIZER!!)

IMPORTANT:

About the smoothing prior parameter $\alpha$ in Multinomial Naive Bayesian Classsification:
* https://stats.stackexchange.com/questions/108797/in-naive-bayes-why-bother-with-laplacian-smoothing-when-we-have-unknown-words-i

Some interesting definitions?:
* http://scikit-learn.org/stable/modules/multiclass.html
* https://stackoverflow.com/questions/20461165/how-to-convert-pandas-index-in-a-dataframe-to-a-column

A search query with interesting results:
* "sklearn.feature_extraction.text.CountVectorizer normalization"

### Normalize Counts for MNB

The normalized data was used for testing but it actually was counterproductive.

A commented line is left for any normalization made if the reader is interested.

In [None]:
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
#normalized_X_trainprimer_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_trainprimer_counts)

### Model

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf_nb_r1 = MultinomialNB(alpha=0.1).fit(X_primercounts, pd_primerclass.new_class)

### Classification

In [None]:
X_test1counts = count_vect_r1.transform(pd_testwodup1.alltext)
#normalized_X_testround1_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_testround1_counts)
predicted_nb_r1 = clf_nb_r1.predict(X_test1counts)

In [None]:
for platform, category in zip(pd_testwodup1.platform, predicted_nb_r1):
    print('%r => %s' % (platform, category))

In [None]:
#clf_nb_r1.predict_proba(X_testwodup1counts)
#clf_nb_r1.get_params()
#clf_nb_r1.classes_

### DECISION TREE CLASSIFICATION 1

Main Reference:
* http://scikit-learn.org/stable/modules/tree.html

Some interesting definitions?:
* 

### Model

In [None]:
from sklearn import tree

In [None]:
clf_dtm_r1 = tree.DecisionTreeClassifier()
clf_dt_r1 = clf_dtm_r1.fit(X_primercounts, pd_primerclass.newclass)

### Classification

In [None]:
X_test1counts = count_vect_r1.transform(pd_testwodup1.alltext)
predicted_dt_r1 = clf_dt.predict(X_test1counts)

In [None]:
for platform, category in zip(pd_testwodup1.platform, predicted_dt_r1):
    print('%r => %s' % (platform, category))

## ROUND 2: adding some revised classifications of platforms to primer from test in Round 1 

### New Training Dataset Preparation - Adding Newly Classified Records with Extended Data and Concat to primer

In [None]:
if pathlib.Path(directory+'test_classes_rev_r1.csv').is_file():
    pd_test1class = pandas.read_csv(open(directory+'test_classes_rev_r1.csv', 'r'))
pd_test1class.head()

In [None]:
pd_testwodup1class = pd_test1wodup.loc[pd_testwodup1.platform.isin(pd_test1class['platform'])]
pd_testwodup1class.head(10)

In [None]:
pd_testwodup1class = pandas.merge(pd_testwodup1class, pd_testwodup1, on='platform')
pd_testwodup1class.head()

In [None]:
pd_combined1class = pandas.concat([pd_primerclass, pd_testwodup1class], ignore_index = True)
pd_combined1class.head(10)

In [None]:
pd_combined1class.shape

In [None]:
#pdcombinedclass.describe

### New Test Dataset Preparation

In [None]:
#taking the same name as above!
pd_testwodup2 = pd_testwodup1.loc[~pd_testwodup1.platform.isin(pd_test2class['platform'])]
pd_testwodup2.head(10)

In [None]:
len(pd_testwodup2)

## New Data Model, with more data 2

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect_r2 = CountVectorizer(ngram_range=(1,2))
X_combined1counts = count_vect_r2.fit_transform(pd_combined1class.alltext)
X_combined1counts.shape

### Correcting some classes...

In [None]:
pd_combined1class.loc[pd_combined1class.newclass.isin(['learn|tutorial|course|training|',
       'learn|tutorial|course|training| tips|example',
       'learn|tutorial|course|training|tips|example']), 'newclass'] = 'learn|tutorial|course|training| tips|example'

In [None]:
pd_combined1class.loc[pd_combined1class.newclass.isin(['(text )?editor|interpreter|repl', '(text)?editor|interpreter|repl']), 'newclass'] = '(text )?editor|interpreter|repl'

## Naive Bayesian Classification 2

### Assigning MNB Classifier Class Reduction to 4 instead of 10

Same as in round one: this part was for further evaluation only.

In [None]:
#pd_combined1class['mnb_class'] = pd_combined1class['newclass']
#pd_combined1class.loc[pd_combined1class.mnb_class.isin(['(text )?editor|interpreter|repl', '---',
#       'cloud|platform|service', 'community|support|people|forum',
#       'design|galler|template|theme',
#       'manual|guide|docs',
#       'on?(-|\\s)?demand|business|compan(y|ies)|enterprise',
#       'searchtools', 'shop|commerce']), 'mnb_class'] = 'other'0
#
#pd_combined1class.loc[pd_combined1class.mnb_class == 'other', 'mnb_class'].describe

### Normalized Counts for MNB

Same as in round one: this part was for further evaluation only.

In [None]:
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
#normalized_X_train_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_combined1counts)

### Model

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf_nb_r2 = MultinomialNB(alpha=0.1).fit(X_combined1counts, pd_combined1class.newclass)

### Classification

In [None]:
X_testwodup2counts = count_vect.transform(pd_testwodup2.alltext)
#normalized_X_nonew_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_testwodup2counts)
predicted_nb_r2 = clf_nb_r2.predict( X_testwodup2counts )

In [None]:
for platform, category in zip(pd_testwodup2.platform, predicted_nb_r2):
    print('%r => %s' % (platform, category))

In [None]:
#clf_nb_r2.predict_proba(X_testwodup2counts)
#clf_nb_r2.get_params()
#clf_nb_r2.classes_

## Decision Tree Classification 2

### Model

In [None]:
from sklearn import tree

In [None]:
clf_dtm_r2 = tree.DecisionTreeClassifier()
clf_dt_r2 = clf_dtm_r2.fit(X_combined1counts, pd_combined1class.newclass)

### Classification

In [None]:
X_testwodup2counts = count_vect.transform(pd_testwodup2.alltext)
predicted_dt_r2 = clf_dt.predict(X_testwodup2counts)

In [None]:
for platform, category in zip(pd_testwodup2.platform, predicted_dt_r2):
    print('%r => %s' % (platform, category))

## ROUND 3: Last round, again increasing training dataset with revised classifications at Round 2

### New Training Dataset Preparation - Adding Newly Classified Records with Extended Data and Concat to primer

In [None]:
if pathlib.Path(directory+'test_classes_rev_r2.csv').is_file():
    pd_test2class = pandas.read_csv(open(directory+'test_classes_rev_r2.csv', 'r'))
pd_test2class.head()

In [None]:
pd_testwodup2class = pd_testwodup1.loc[pd_testwodup1.platform.isin(pd_test2class['platform'])]
pd_testwodup2class.head(10)

In [None]:
pd_testwodup2class = pandas.merge(pd_testwodup2class, pd_testwodup1, on='platform')
pd_testwodup2class.head()

In [None]:
pd_combined2class = pandas.concat([pd_combined1class, pd_testwodup2class], ignore_index = True)
pd_combined2class.head(10)

In [None]:
pd_combined2class.shape

In [None]:
#pdcombinedclass.describe

### New Test Dataset Preparation

In [None]:
#taking the same name as above!
pd_testwodup3 = pd_testwodup2.loc[~pd_testwodup2.platform.isin(pd_test2class['platform'])]
pd_testwodup3.head(10)

In [None]:
len(pd_testwodup3)

## New Data Model, with more data 3

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect_r3 = CountVectorizer(ngram_range=(1,2))
X_combined2counts = count_vect_r2.fit_transform(pd_combined2class.alltext)
X_combined2counts.shape

### Correcting some classes...

In [None]:
pd_combined2class.loc[pd_combined2class.newclass.isin(['learn|tutorial|course|training|',
       'learn|tutorial|course|training| tips|example',
       'learn|tutorial|course|training|tips|example']), 'newclass'] = 'learn|tutorial|course|training| tips|example'

In [None]:
pd_combined2class.loc[pd_combined2class.newclass.isin(['(text )?editor|interpreter|repl', '(text)?editor|interpreter|repl']), 'newclass'] = '(text )?editor|interpreter|repl'

## Naive Bayesian Classification 3

### Assigning MNB Classifier Class Reduction to 4 instead of 10

Same as in round one: this part was for further evaluation only.

In [None]:
#pd_combined2class['mnb_class'] = pd_combined2class['newclass']
#pd_combined2class.loc[pd_combined2class.mnb_class.isin(['(text )?editor|interpreter|repl', '---',
#       'cloud|platform|service', 'community|support|people|forum',
#       'design|galler|template|theme',
#       'manual|guide|docs',
#       'on?(-|\\s)?demand|business|compan(y|ies)|enterprise',
#       'searchtools', 'shop|commerce']), 'mnb_class'] = 'other'
#
#pd_combined2class.loc[pd_combined1class.mnb_class == 'other', 'mnb_class'].describe

### Normalized Counts for MNB

Same as in round one: this part was for further evaluation only.

In [None]:
#http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
#normalized_X_train_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_combined2counts)

### Model

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf_nb_r3 = MultinomialNB(alpha=0.1).fit(X_combined2counts, pd_combined2class.newclass)

### Classification

In [None]:
X_testwodup3counts = count_vect.transform(pd_testwodup3.alltext)
#normalized_X_nonew_counts = sklearn.feature_extraction.text.TfidfTransformer(norm='l2').fit_transform(X_testwodup2counts)
predicted_nb_r3 = clf_nb_r3.predict( X_testwodup3counts )

In [None]:
for platform, category in zip(pd_testwodup3.platform, predicted_nb_r3):
    print('%r => %s' % (platform, category))

In [None]:
#clf_nb_r3.predict_proba(X_testwodup3counts)
#clf_nb_r3.get_params()
#clf_nb_r3.classes_

## Decision Tree Classification 3

### Model

In [None]:
from sklearn import tree

In [None]:
clf_dtm_r3 = tree.DecisionTreeClassifier()
clf_dt_r3 = clf_dtm_r3.fit(X_combined2counts, pd_combined2class.newclass)

### Classification

In [None]:
X_testwodup3counts = count_vect.transform(pd_testwodup3.alltext)
predicted_dt_r3 = clf_dt.predict(X_testwodup3counts)

In [None]:
for platform, category in zip(pd_testwodup3.platform, predicted_dt_r3):
    print('%r => %s' % (platform, category))

## SAVING THE CLASSIFIED PLATFORMS 