## First NLP task
The data we are dealing with is raw text. Our overall problem is to classify these text files as one of 20 classes. Now to do the classification we will use Naive Bayes. But first we need to convert the raw text into usable observations, ie convert the raw text into a feature vector. To do this we will rely on the NLTK library. The workflow is taken from: https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e . For more indepth guidance : https://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk . And for processing the text files: https://www.nltk.org/book/ch03.html#fig-pipeline1 .

We will first look at some singular documents to see what sort of preprocessing is required. I guess in the general ML workflow this is a combination of the data analysis and feature extraction part.

In [1]:
import nltk, re, pprint

### Document 1
Class: alt.atheism Id:49960


In [2]:
d1_path = open('/home/antoni/Documents/Sample Data/NewsGroups20/20NewsGroups(2)/49960')
d1 = d1_path.read()

We had some problems opening the text files because they had this strange /FF line at the end of each document. I think this might be significant?

### Step 1: Sentence Segmentation

In [3]:
sent_tokenize_list = nltk.sent_tokenize(d1)

In [4]:
sent_tokenize_list[10]


'For net people who go to Lynn directly, the\nprice is $4.95 per fish.'

### Step 2: Word Tokenization

In [5]:
words = nltk.word_tokenize(sent_tokenize_list[100])
words

['Library', 'of', 'Congress', 'Catalog', 'Card', 'Number', '89-64079', '.']

### Step 3: Predicting Parts of Speech for Each Token

In [6]:
nltk.pos_tag(words)

[('Library', 'NNP'),
 ('of', 'IN'),
 ('Congress', 'NNP'),
 ('Catalog', 'NNP'),
 ('Card', 'NNP'),
 ('Number', 'NNP'),
 ('89-64079', 'CD'),
 ('.', '.')]

### Step 4: Text Lemmatization
I mean for now we will omit this step, because I am not sure when it is best to lemmatize? Before POS tagging or after.

In [7]:
ps = nltk.PorterStemmer()
l =[ps.stem(i) for i in words]
l

['librari', 'of', 'congress', 'catalog', 'card', 'number', '89-64079', '.']

In [8]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

### Step 5: Identifying Stop Words

In [9]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [10]:
s = []
for i in l: 
    if i not in stop_words:
        s.append(i)
s

['librari', 'congress', 'catalog', 'card', 'number', '89-64079', '.']

### Step 6: Dependency Parsing
I mean I assume this is quite computationally expensive, and maybe it is not totally necessary for our classification problem. For the time being I will omit this step. There are some other steps which I don't think will be necessary for our classification problem. These are Noun phrase tagging, Named Entity Recognition, Coreference resolution

## Task 1
I think the way we will proceed is tokenize the words. We will remove stop words, numbers and other symbols. We will see if this can give us good classifcational accuracy. We will be using https://pythonprogramming.net/text-classification-nltk-tutorial/ . 

### We Need to create our own corpus
https://www.nltk.org/book/ch02.html


In [11]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/home/antoni/Documents/Sample Data/NewsGroups20/20_newsgroups/alt.atheism/' 
wordlists = PlaintextCorpusReader(corpus_root, '.*') 

In [12]:
wordlists.words('51124')

['Newsgroups', ':', 'alt', '.', 'atheism', 'Path', ':', ...]

In [13]:
wordlists.fileids()

['49960',
 '51060',
 '51119',
 '51120',
 '51121',
 '51122',
 '51123',
 '51124',
 '51125',
 '51126',
 '51127',
 '51128',
 '51129',
 '51130',
 '51131',
 '51132',
 '51133',
 '51134',
 '51135',
 '51136',
 '51137',
 '51138',
 '51139',
 '51140',
 '51141',
 '51142',
 '51143',
 '51144',
 '51145',
 '51146',
 '51147',
 '51148',
 '51149',
 '51150',
 '51151',
 '51152',
 '51153',
 '51154',
 '51155',
 '51156',
 '51157',
 '51158',
 '51159',
 '51160',
 '51161',
 '51162',
 '51163',
 '51164',
 '51165',
 '51166',
 '51167',
 '51168',
 '51169',
 '51170',
 '51171',
 '51172',
 '51173',
 '51174',
 '51175',
 '51176',
 '51177',
 '51178',
 '51179',
 '51180',
 '51181',
 '51182',
 '51183',
 '51184',
 '51185',
 '51186',
 '51187',
 '51188',
 '51189',
 '51190',
 '51191',
 '51192',
 '51193',
 '51194',
 '51195',
 '51196',
 '51197',
 '51198',
 '51199',
 '51200',
 '51201',
 '51202',
 '51203',
 '51204',
 '51205',
 '51206',
 '51207',
 '51208',
 '51209',
 '51210',
 '51211',
 '51212',
 '51213',
 '51214',
 '51215',
 '51216',


In [14]:
documents = []
for category in mr.categories():
    for fileid in mr.fileids(category):
        documents.append((list(mr.words(fileid), category)))

NameError: name 'mr' is not defined