## Natural Language Analytics - Voulgari Eleni

### Exercise 1: Pre-processing 
#### Question A

Load libraries.

In [1]:
import os
import re
import nltk
import string
from urllib import urlopen
from nltk import FreqDist
from bs4 import BeautifulSoup
from collections import Counter
import html5lib
from nltk.corpus import stopwords
from nltk.util import ngrams

Load and read the html page.

In [2]:
url = "https://en.wikipedia.org/wiki/Artificial_neural_network"
html = urlopen(url).read()

Use of the library Beautiful Soup for pulling the data out of the HTML file.

In [3]:
pulled_data = BeautifulSoup(html, 'html5lib')
pulled_data

<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en"><head>\n<meta charset="unicode-escape"/>\n<title>Artificial neural network - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Artificial_neural_network","wgTitle":"Artificial neural network","wgCurRevisionId":836490687,"wgRevisionId":836490687,"wgArticleId":21523,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: Explicit use of et al.","CS1 maint: Uses editors parameter","CS1 German-language sources (de)","CS1 maint: Uses authors parameter","Use dmy dates from June 2013","All articles with unsourced statements","Articles with unsourced statements from August 2017","Wikipedia

Remove the unnecessary tags of the html text, like "script", "mstyle" and "span", together with their content.

The rest of the text, which is the part we need to work on, is taken with the help of get_text function of Beautiful Soup.

In [4]:
# Remove unnecessary things (scripts, styles, ...)
for tags in pulled_data(['script', 'mstyle', 'span']): 
    tags.decompose()

# Get the rest of the text
pulled_text = pulled_data.get_text()
pulled_text

u'\n\nArtificial neural network - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\n\nArtificial neural network\t\t\t\n\t\t\t\tFrom Wikipedia, the free encyclopedia\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\t\t\t\t\n\t\t\t\t"Neural network" redirects here. For other uses, see Neural network (disambiguation).\n\n\nMachine learning and\ndata mining\n\n\n\n\n\n\n\nProblems\n\n\n\nClassification\nClustering\nRegression\nAnomaly detection\nAutoML\nAssociation rules\nReinforcement learning\nStructured prediction\nFeature engineering\nFeature learning\nOnline learning\nSemi-supervised learning\nUnsupervised learning\nLearning to rank\nGrammar induction\n\n\n\n\n\n\n\n\n\n\nSupervised learning\n\n\n\n\n\nDecision trees\nEnsembles (Bagging, Boosting, Random forest)\nk-NN\nLinear regression\nNaive Bayes\nNeural networks\nLogistic regression\nPerceptron\nRelevance vector machine (RVM)\nSupport vector machine 

#### First Question - Word Count and Vocabulary of Web page

In [5]:
# Tokenize the text into list of words
tokens = nltk.word_tokenize(pulled_text)

# Convert to lower-case
tokens = [word.lower() for word in tokens]
print "Lower-cased tokens:", tokens    

# Remove punctuation and numbers
nonPunct = re.compile('.*[A-Za-z].*')
raw_words = [tok for tok in tokens if nonPunct.match(tok)]

Lower-cased tokens: [u'artificial', u'neural', u'network', u'-', u'wikipedia', u'artificial', u'neural', u'network', u'from', u'wikipedia', u',', u'the', u'free', u'encyclopedia', u'jump', u'to', u':', u'navigation', u',', u'search', u"''", u'neural', u'network', u"''", u'redirects', u'here', u'.', u'for', u'other', u'uses', u',', u'see', u'neural', u'network', u'(', u'disambiguation', u')', u'.', u'machine', u'learning', u'and', u'data', u'mining', u'problems', u'classification', u'clustering', u'regression', u'anomaly', u'detection', u'automl', u'association', u'rules', u'reinforcement', u'learning', u'structured', u'prediction', u'feature', u'engineering', u'feature', u'learning', u'online', u'learning', u'semi-supervised', u'learning', u'unsupervised', u'learning', u'learning', u'to', u'rank', u'grammar', u'induction', u'supervised', u'learning', u'decision', u'trees', u'ensembles', u'(', u'bagging', u',', u'boosting', u',', u'random', u'forest', u')', u'k-nn', u'linear', u'regress

In [6]:
# Count the total number of words in the text including stopwords
total_words = len(raw_words)
print "The total number of words is:", total_words, "\n"

# Use of the collections library and Counter module to produce the word count
word_count = Counter(raw_words)
print "The word count is shown below:\n", word_count

The total number of words is: 10743 

The word count is shown below:
Counter({u'the': 560, u'of': 314, u'a': 296, u'and': 259, u'to': 235, u'in': 183, u'is': 154, u'neural': 150, u'learning': 123, u'that': 103, u'as': 101, u'networks': 101, u'network': 94, u'for': 93, u'by': 86, u'are': 80, u'with': 78, u'can': 65, u'be': 65, u'this': 64, u'from': 59, u'an': 59, u'function': 59, u'on': 58, u'machine': 49, u'or': 49, u'model': 47, u'deep': 47, u'computing': 46, u'input': 45, u'memory': 40, u'data': 40, u'it': 39, u'such': 37, u'layers': 35, u'cost': 35, u'output': 35, u'models': 34, u'recognition': 34, u'artificial': 33, u'used': 32, u'neurons': 30, u'architecture': 26, u'has': 26, u'other': 26, u'using': 25, u'algorithm': 25, u'have': 25, u'layer': 25, u'weights': 24, u'use': 24, u'which': 24, u'each': 23, u'was': 23, u'not': 23, u'processing': 23, u'they': 22, u'training': 22, u'backpropagation': 22, u'hidden': 21, u'one': 21, u'lstm': 21, u'where': 21, u'their': 21, u'unit': 20, u'th

In [7]:
# Use of FreqDist module to create the vocabulary of unique words of the Web page
vocabulary = nltk.probability.FreqDist(raw_words)

# Print the object, number of elements and the words of the vocabulary
print "The object of vocabulary is:", vocabulary, "\n"
print "The number of unique words is:", len(vocabulary), "\n"

print "The vocabulary is shown below:\n"
for word in vocabulary:
    print word

The object of vocabulary is: <FreqDist with 2700 samples and 10743 outcomes> 

The number of unique words is: 2700 

The vocabulary is shown below:

limited
two-dimensional
hmms
interference
desirable
borgelt
al.cs1
circuitry
entropy
consists
neuro-fuzzy-systeme
poorly
whose
calculate
similarity
exploding
bias-variance
endianness
under
worth
updated
risk
downstream
floating-point
dynamic
activation
rise
amorphous
accelerator
every
updates
namespaces
classifications
9-bit
task-specific
vast
networksclassification
basics
transduce
solution
broyden-fletcher-goldfarb-shanno
vector
ultra-low-voltage
rprop
sleep
markov
cmt
1940s
commented
specially
disciplines
concise
hierarchical-deep
estimates
batch
approximation
second
sarsa
zisc
estimated
machines
even
employ
errors
max-pooling
selected
asic
while
reconstruction
retrieved
human-competitive
new
net
asip
simultaneously
here
reported
protection
represented
wwn-1
wwn-7
digits
property
items
k
changed
pdfprintable
pooling
menu
permit
calculat

university
sufficiently
magnitude
l2
softmax
h.
title=artificial_neural_network
reproducing
mcculloch
determine
map
pitts
related
constitute
frequency
static
variety
measure
our
salesman
special
out
maximizes
dbscan
matrix
multi-stage
integrates
dsns
performs
maximized
adaptive
induction
integrated
red
interactions
hyperparameters
ecl
something-for-nothing
greedy
powerpc
approaches
tied
philip
backwards
organic
determining
could
feasible
skin-surface
explored
time-varying
quadratically
facilitate
pada
isbi
isbn
matthias
powerful
15-bit
echo
improvements
hertz
representations
reached
quality
long
management
hyperbolic
unknown
system
relations
their
memristor
automl
lapa
programmable
final
non-parametric
paved
originating
accompany
fine-tuning
shallow
wiley
simulate
fuzzy
oisc
dynamical
observed
colleagues
holland
pointers
depends
limiting
defense
pagespermanent
graupe
counter
robot
have
need
kelley
vieweg
viewed
documents
marginalizing
agency
able
accelerated
mechanism
contact
mix
thrus

#### Second Question - Sentences

In [8]:
# Use of the sent_tokenize module to find the number of sentences
sentences = nltk.tokenize.sent_tokenize(pulled_text)
print "The number of sentences is:", len(sentences)

The number of sentences is: 565


#### Third Question - Lexical Diversity

Lexical diversity refers to the variety of words used in a text. It generally measures the number of unique words occurring in a text by the number of tokens.

In [9]:
# Find the lexical diversity by dividing the number of unique words by the number of total words
diversity = len(vocabulary) / float(total_words)
print "The lexical diversity is:", diversity

The lexical diversity is: 0.251326445127


#### Fourth Question  - 5 Most Common Lexical Categories (parts of speech)

To find the 5 most common lexical categories we perform part-of-speech tagging which is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech.

In [10]:
# Use of pos_tag module to perform POS tagging
part_of_speech_tag = nltk.pos_tag(raw_words)

# Make a list of the part-of-speech tags
pos_tags = [pos for (word, pos) in part_of_speech_tag]

# Find and print the part-of-speech tags which appear mostly in the above list
five_common = nltk.FreqDist(pos_tags).most_common(5)
print five_common

[('NN', 2759), ('JJ', 1632), ('IN', 1202), ('DT', 1094), ('NNS', 931)]


#### Fifth Question - 10 Most Common Unigrams - 10 Most Common Bigrams

An n-gram is a contiguous sequence of n items from a given sample of text or speech. So in the case of the unit being a word, a unigram is a word and a bigram is a sequence of two words. 

In [11]:
# Remove the stop words of English, using the nltk.corpus.stopwords('english') list
filtered_words = [word for word in raw_words if word not in stopwords.words('english')]

# Use of the ngrams and Counter modules to create the unigrams and bigrams and then find the 10 most common
unigrams = ngrams(filtered_words,1)
ten_unigrams = Counter(unigrams).most_common(10)
print "The 10 unigrams are:", ten_unigrams
print "\n"

bigrams = ngrams(filtered_words,2)
ten_bigrams = Counter(bigrams).most_common(10)
print "The 10 bigrams are:", ten_bigrams

The 10 unigrams are: [((u'neural',), 150), ((u'learning',), 123), ((u'networks',), 101), ((u'network',), 94), ((u'function',), 59), ((u'machine',), 49), ((u'model',), 47), ((u'deep',), 47), ((u'computing',), 46), ((u'input',), 45)]


The 10 bigrams are: [((u'neural', u'networks'), 68), ((u'neural', u'network'), 50), ((u'artificial', u'neural'), 19), ((u'machine', u'learning'), 17), ((u'deep', u'learning'), 15), ((u'cost', u'function'), 15), ((u'isbn', u'oclc'), 13), ((u'pattern', u'recognition'), 11), ((u'activation', u'function'), 9), ((u'main', u'article'), 9)]


If we have stemmed the words, which means that we would do the process of reducing inflected (or sometimes derived) words to their word stem, base or root form — generally a written word form, the words "networks" and "network" would appear once as a stem.

#### Sixth Question - Number of Nouns in the Page

In [12]:
# Make a list of words that are tagged as nouns in the process of part-of-speech tagging
nouns = [word for (word, pos) in part_of_speech_tag if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]
print "The number of nouns is:", len(nouns)

The number of nouns is: 3700
