# Assignment 2 - part-of-speech prediction from limited context

In this assignment, you will train classifiers that attempt, within a window of five words, to make a binary prediction about whether the third word belongs to a given part of speech (noun, verb, adjective, adverb), but using very limited information -- that is, the last two letters of the first, second, fourth, and fifth word of the sequence, and no information whatsoever directly from the third word itself.  You will strip out all punctuation (using the NLTK `WordPunctTokenizer`), lowercase, and remove stop words (using the NLTK English stop words list).

In other words, you will predict over samples that have two classes, P and not-P, where P is the selected part of speech to classify.  For example, from the sentence, "The quick brown fox jumped over the lazy dog.", we can select the following 5-word windows without stop words, "brown fox jumped lazy dog" and "quick brown fox jumped lazy".  If we select verbs as the part-of-speech we are classifying over, we get the instances <(wn,ox,zy,og),1>, since "jumped" is a verb, but <(ck,wn,ed,zy),0> because "fox" in that context is not.

This means that you will need to take into account the position of the last-two-letter feature:  "zy" as the fourth word's last two letters is different from "zy" as the fifth word's last two letters.  They are two features, say, `zy_4` and `zy_5`.

This will likely not actually work.  But it might!

You will create training and testing samples according to this procedure, and you will build a data structure that can be fed to a support vector machine (SVM) classifier.  You will train the classifier on the training data and evaluate it on the testing data. 

The work will be done in a .py module file in the same folder as this notebook.  **No modifications to this notebook will be graded.** We will run your module using this notebook or one we modify that you won't see in order to test your code.

The file you must create and add to the github repo is `mycode.py`, which will be imported here.  You can create your own notebooks or scripts to test it.  You can put any number of your own helper functions and also put optional parameters on any of the python functions mentioned here. You should also create a Markdown file, `notes.md`, to keep any **concise** notes and remarks about the assignment.  The code must run on mltgpu.

**This assignment is due Monday, 2022 March 7, at 23:59. There are 33 points and 5 bonus points.**

In [1]:
import mycode as mc

In [2]:
mc.test()

ca marche!


## Part 0 - preparation (2 points)

Fork this repository and create and add `mycode.py` and `notes.md`. 

## Part 1 - obtaining the text (3 points)

You will randomly select the given number of lines from the gzipped file we give you (so you will have to figure out how to access gzipped text files).  Explain how you implemented the random selection in `notes.md`. When we run it, it should give a new sample every time.

In [3]:

sampled_lines = mc.sample_lines("/scratch/UN-english.txt.gz", lines=100000)

print(len(sampled_lines))
sampled_lines[40000:40010]

100000


[b'(Thirty-fourth session, Geneva, 18-20 September 2002)\n',
 b'Mr. Walid Al-Hadid\n',
 b'It was also considered unacceptable for public funds to be used to compensate for loss that should be allocated to the operator.\n',
 b'INLAND TRANSPORT COMMITTEE\n',
 b'Mr. J. F. Kissack\n',
 b'Those issues and the needed response were highlighted in the Almaty Programme of Action.\n',
 b'We are dealing with the agenda item entitled "Oceans and the law of the sea" on the very same day we are celebrating the tenth anniversary of the entry into force of the United Nations Convention on the Law of the Sea, a text of historic impact. The Convention has made an indisputable contribution to the codification of international law of the sea and is an important milestone in the establishment of a global legal framework for governance of the various marine areas and their living and non-living resources.\n',
 b'The latter would continue to be reviewed by UNMOVIC and IAEA to ensure that items were not inclu

## Part 2 - creating the samples (7 points)

From the sampled lines, you will then randomly create the five-word samples.

You will tokenize the sentences and apply POS-tagging to them -- you need to do this before you create the samples, since POS-tagging needs context. You will then remove stop words and punctuation and lowercase the remainder.  Next, you will randomly, over the entire set of sentences, choose samples of five words in sequence, up to a certain limit.  You find the last two characters of the first, second, fourth, and fifth words, and create the type of structure specified up in the introduction to this assignment for each sample. The exact representation is up to you.

In [115]:
import nltk
import pandas as pd
def list_df_POStag(s):
    list_of_df=[]
    for sent in s:
        
        list_of_df.append(pd.DataFrame(nltk.pos_tag(WordPunctTokenizer().tokenize(str(sent).strip("b\" ,()f, .:;'\\").removesuffix("\\n").strip(",. ;:'\"'"))),columns=['word','POS']))
        
    return list_of_df  

In [116]:
list_of_df=list_df_POStag(sampled_lines[40000:40010])

In [117]:
list_of_df

[             word  POS
 0              In   IN
 1            this   DT
 2         context   NN
 3               ,    ,
 4             the   DT
 5       Committee  NNP
 6             has  VBZ
 7        likewise   RB
 8         focused  VBN
 9              on   IN
 10            the   DT
 11  participation   NN
 12             of   IN
 13          rural   JJ
 14          women  NNS
 15             in   IN
 16          local   JJ
 17            and   CC
 18       national   JJ
 19         public   JJ
 20       decision   NN
 21              -    :
 22         making   NN
 23              ,    ,
 24             as   IN
 25              a   DT
 26          means   NN
 27             of   IN
 28    empowerment   NN
 29            and   CC
 30             of   IN
 31      enhancing  VBG
 32         access   NN
 33             to   TO
 34     productive   JJ
 35      resources  NNS,
             word  POS
 0            The   DT
 1           list   NN
 2             of   IN
 3     additional  

In [103]:
dg=list_df_POStag(["that happens that the word  67 happens is said twice"])

In [49]:
dg[0]

Unnamed: 0,word,POS
0,that,DT
1,happens,VBZ
2,that,IN
3,the,DT
4,word,NN
5,67,CD
6,happens,NNS
7,is,VBZ
8,said,VBD
9,twice,RB


In [46]:
dg[0][dg[0]['word']=='that']

Unnamed: 0,word,POS
0,that,DT
2,that,IN


In [53]:
dg[0]['word'].str.lower()

0       that
1    happens
2       that
3        the
4       word
5         67
6    happens
7         is
8       said
9      twice
Name: word, dtype: object

In [59]:
dg[0].drop(dg[0][~dg[0]['word'].str.isalpha()].index)

Unnamed: 0,word,POS
0,that,DT
1,happens,VBZ
2,that,IN
3,the,DT
4,word,NN
6,happens,NNS
7,is,VBZ
8,said,VBD
9,twice,RB


In [75]:
stop_words = stopwords.words('english')
pattern = '|'.join(stop_words)

In [96]:
dg[0]['word'].str.contains(pattern)

0     True
1    False
2     True
3    False
4     True
5    False
6    False
7    False
8    False
9    False
Name: word, dtype: bool

In [22]:
import nltk
from nltk.tokenize import WordPunctTokenizer
import pandas as pd
def list_dic_POStag1(s):
    list_of_dic=[]
    for sent in s:
        
        list_of_dic.append(pd.DataFrame(nltk.pos_tag(WordPunctTokenizer().tokenize(str(sent).strip("b\" ,()f, .:;'\\").removesuffix("\\n").strip(",. ;:'\"'"))),columns=['word','POS']))
        
    return list_of_dic  

In [99]:
#pattern='|'.join(['that','word'])
dg[0].drop(dg[0][dg[0]['word'].str.contains(pattern)].index)

Unnamed: 0,word,POS
1,happens,VBZ
3,the,DT
5,67,CD
6,happens,NNS
7,is,VBZ
8,said,VBD
9,twice,RB


In [23]:
list_of_dic=list_dic_POStag1(sampled_lines[40000:40010])
list_of_dic

[           word  POS
 0             d   NN
 1             )    )
 2  Proportional  NNP
 3         means  NNS,
          word  POS
 0       Voice  NNP
 1   reporting  VBG
 2         via   IN
 3         VHF  NNP
 4           (    (
 5          ch   NN
 6           .    .
 7           5   CD
 8           .    .
 9           1   CD
 10          .    .
 11          2   CD
 12          (    (
 13          7   CD
 14         ))   NN,
           word   POS
 0    Therefore    RB
 1            ,     ,
 2          the    DT
 3       author    NN
 4       argues   VBZ
 5         that    RB
 6            ,     ,
 7           as    IN
 8           he   PRP
 9          had   VBD
 10          at    IN
 11       least   JJS
 12          up    RB
 13       until    IN
 14           5    CD
 15     January   NNP
 16        1999    CD
 17           (     (
 18         two    CD
 19      months   NNS
 20        from    IN
 21         the    DT
 22     passing    NN
 23          of    IN
 24         the   

In [14]:
list_of_dic[1].keys()

dict_keys(['Voice', 'reporting', 'via', 'VHF', '(', 'ch', '.', '5', '1', '2', '7', '))'])

In [104]:
stop_words = stopwords.words('english')
pattern = '|'.join(stop_words)
dg[0]['word']=dg[0]['word'].str.lower() 
dg[0]=dg[0][dg[0]['word'].str.contains(pattern)]

In [106]:
pattern

"i|me|my|myself|we|our|ours|ourselves|you|you're|you've|you'll|you'd|your|yours|yourself|yourselves|he|him|his|himself|she|she's|her|hers|herself|it|it's|its|itself|they|them|their|theirs|themselves|what|which|who|whom|this|that|that'll|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|don't|should|should've|now|d|ll|m|o|re|ve|y|ain|aren|aren't|couldn|couldn't|didn|didn't|doesn|doesn't|hadn|hadn't|hasn|hasn't|haven|haven't|isn|isn't|ma|mightn|mightn't|mustn|mustn't|needn|needn't|shan|shan't|shouldn|shouldn't|wasn|wasn't|weren|weren't|won|won't|wouldn|wouldn't"

In [105]:
dg[0]

Unnamed: 0,word,POS
0,that,DT
1,happens,VBZ
2,that,IN
3,the,DT
4,word,NN
6,happens,NNS
7,is,VBZ
8,said,VBD
9,twice,RB


In [107]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer


def processed_sentences(list_of_df):
    result=[]
    stop_words = stopwords.words('english')
    
    for df in list_of_df:
        df['word']=df['word'].str.lower() 
        df=df.drop(df[df['word'].isin(stop_words)].index)
        df=df.drop(df[~df['word'].str.isalpha()].index)
        df.reset_index(drop=True, inplace=True)       
        result.append(df)
    return result

In [108]:
processed_sentences(list_of_df)

[           word  POS
 0  proportional  NNP
 1         means  NNS,
         word  POS
 0      voice  NNP
 1  reporting  VBG
 2        via   IN
 3        vhf  NNP
 4         ch   NN,
           word  POS
 0    therefore   RB
 1       author   NN
 2       argues  VBZ
 3        least  JJS
 4      january  NNP
 5          two   CD
 6       months  NNS
 7      passing   NN
 8   resolution   NN
 9         file   VB
 10   complaint   NN
 11    actually   RB
 12    december  NNP
 13        well   RB
 14      within   IN
 15  limitation   NN
 16      period   NN,
             word  POS
 0           case   NN
 1        vehicle   NN
 2       equipped  VBN
 3       electric   JJ
 4   regenerative   NN
 5        braking   NN
 6         system   NN
 7   requirements  NNS
 8         depend  VBP
 9       category   NN
 10        system   NN,
               word  POS
 0            staff   NN
 1           member   NN
 2   misrepresented  VBN
 3        falsified  VBN
 4    certification   NN
 5        ed

In [89]:
list_of_df

[           word  POS
 0             d   NN
 1             )    )
 2  proportional  NNP
 3         means  NNS,
          word  POS
 0       voice  NNP
 1   reporting  VBG
 2         via   IN
 3         vhf  NNP
 4           (    (
 5          ch   NN
 6           .    .
 7           5   CD
 8           .    .
 9           1   CD
 10          .    .
 11          2   CD
 12          (    (
 13          7   CD
 14         ))   NN,
           word   POS
 0    therefore    RB
 1            ,     ,
 2          the    DT
 3       author    NN
 4       argues   VBZ
 5         that    RB
 6            ,     ,
 7           as    IN
 8           he   PRP
 9          had   VBD
 10          at    IN
 11       least   JJS
 12          up    RB
 13       until    IN
 14           5    CD
 15     january   NNP
 16        1999    CD
 17           (     (
 18         two    CD
 19      months   NNS
 20        from    IN
 21         the    DT
 22     passing    NN
 23          of    IN
 24         the   

In [124]:
mc?

In [6]:
processed_sentences = mc.process_sentences(sampled_lines)
processed_sentences[40000:40010]

[        word  POS
 0     thirty  NNP
 1     fourth   JJ
 2    session   NN
 3     geneva  NNP
 4  september  NNP,
     word  POS
 0     mr  NNP
 1  walid  NNP
 2     al  NNP
 3  hadid   NN,
            word  POS
 0          also   RB
 1    considered  VBN
 2  unacceptable   JJ
 3        public   JJ
 4         funds  NNS
 5          used  VBN
 6    compensate   VB
 7          loss   NN
 8     allocated  VBN
 9      operator   NN,
         word  POS
 0     inland  NNP
 1  transport  NNP
 2  committee  NNP,
       word  POS
 0       mr  NNP
 1        j  NNP
 2        f  NNP
 3  kissack   VB,
           word  POS
 0       issues  NNS
 1       needed  VBN
 2     response   NN
 3  highlighted  VBN
 4       almaty  NNP
 5    programme  NNP
 6       action  NNP,
              word   POS
 0         dealing   VBG
 1          agenda    NN
 2            item    NN
 3        entitled   VBD
 4          oceans  NNPS
 5             law    NN
 6             sea    NN
 7             day    NN
 8     ce

In [8]:
processed_sentences[40009].shape

(3, 2)

In [6]:
len(processed_sentences)

100000

In [6]:
#first drop the sentences with less than 5 words:
long_enough_sentences=[df for df in processed_sentences if df.shape[0]>=5]
    

In [7]:
len(long_enough_sentences)

76704

In [9]:
#then we pick randomly 50000 of the left sentences
samples=50000
import random
if samples>len(long_enough_sentences):
    raise Exception("pick a smaller number of samples or a bigger list of processed_sentences")
all_sentences=random.sample(long_enough_sentences, samples)

In [10]:
len(all_sentences)

50000

In [12]:
#for each df in all_sentences, pick a random 5 elements window
all_samp=[]
for df in all_sentences:
    n=random.randint(0,df.shape[0]-5)
    dg=df.iloc[n:n+5,:]
    all_samp.append(dg)

In [10]:
import random
def create_samples(processed_sentences, samples):
    #first drop the sentences with less than 5 words:
    long_enough_sentences=[df for df in processed_sentences if df.shape[0]>=5]
    #then we pick randomly 50000 of the left sentences
    if samples>len(long_enough_sentences):
        raise Exception("pick a smaller number of samples or a bigger list of processed_sentences")
    all_sentences=random.sample(long_enough_sentences, samples)
    #for each df in all_sentences, pick a random 5 elements window
    all_samp=[]
    for df in all_sentences:
        n=random.randint(0,df.shape[0]-5)
        dg=df.iloc[n:n+5,:]
        all_samp.append(dg)
    
    return all_samp

In [11]:
all_samp=create_samples(processed_sentences,samples=50000)

In [13]:
all_samples=all_samp

In [14]:
len(all_samples)

50000

In [7]:
all_samples = mc.create_samples(processed_sentences, samples=50000)

print(all_samples[25000:25010])

[              word  POS
12        republic  NNP
13       lithuania  NNP
14        prohibit   VB
15  discrimination   NN
16          direct   JJ,       word POS
4  project  NN
5  manager  NN
6    would  MD
7  oversee  VB
8    pilot  NN,         word  POS
0      price   NN
1   quantity   NN
2    indices  NNS
3  transport   NN
4   services  NNS,              word   POS
15  international    JJ
16          organ    NN
17         united   NNP
18        nations  NNPS
19         unique    JJ,             word  POS
3      indicates  VBZ
4  participation   NN
5        parents  NNS
6       children  NNS
7         always   RB,           word  POS
3  testimonies  NNS
4      abusive   JJ
5    treatment   NN
6        girls  NNS
7        women  NNS,              word  POS
3            anti  NNP
4  discrimination   NN
5            unit  NNP
6  implementation   NN
7      commission  NNP,             word  POS
11        forces  NNS
12  investigated  VBN
13         order   NN
14           put   VB
15    

## Part 3 - convert the samples into a Pandas DataFrame (10 points)

Here, you will take the samples and create a table whose columns are the features and the class and whose rows are the samples.  All the features and the class will be binary.  Note that there may be many columns, in the hundreds or thousands depending on the diversity of the final two consonants of the non-stop-words in the dataset, but the sum of all rows will be five.

In [6]:
import numpy as np
#First we transform each df in the following array: first element= last two letters of the first word
                                                    #second elt = last two letters of the second word
                                                    #third elt= last two letters of the fourth word
                                                    #fourth elt = last two letters of the 5th word
                                                    #fifth element = POS of the third word
def transform(all_samples):
    trans=[]
    for df in all_samples:
        arr=np.zeros(5,dtype=object)
        arr[0]=df['word'].iloc[0][-2:]
        arr[1]=df['word'].iloc[1][-2:]
        arr[2]=df['word'].iloc[3][-2:]
        arr[3]=df['word'].iloc[4][-2:]
        arr[4]=df['POS'].iloc[2]
        trans.append(arr)
    return trans

In [8]:
transformed=transform(all_samples)

In [9]:
transformed[100:110]

[array(['rs', 'ft', 'ed', 'nt', 'NN'], dtype=object),
 array(['ed', 'ry', 'ms', 'ct', 'JJ'], dtype=object),
 array(['et', 'te', 'ps', 'as', 'JJ'], dtype=object),
 array(['io', 'ts', 'al', 'ed', 'RB'], dtype=object),
 array(['ny', 'ng', 'ed', 'ed', 'NN'], dtype=object),
 array(['ee', 'nt', 'ns', 'ry', 'NNP'], dtype=object),
 array(['ed', 'ly', 'ss', 'on', 'VBG'], dtype=object),
 array(['rs', 'ct', 'ct', 'pc', 'VBD'], dtype=object),
 array(['sm', 'nt', 'ws', 'ns', 'NNP'], dtype=object),
 array(['nd', 'so', 'id', 'ns', 'VBZ'], dtype=object)]

In [84]:
#We join them in one matrix:
matrice=np.stack(transformed)

In [85]:
#We chose a class to classify: VBZ
matrice[:,4]=1*(matrice[:,4]=='NN')

In [64]:
dg=pd.DataFrame(transformed)

In [65]:
dg.groupby(dg.iloc[:,4]).count()

Unnamed: 0_level_0,0,1,2,3,4
4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
$,2,2,2,2,2
CC,11,11,11,11,11
CD,326,326,326,326,326
DT,46,46,46,46,46
EX,1,1,1,1,1
FW,37,37,37,37,37
IN,490,490,490,490,490
JJ,6659,6659,6659,6659,6659
JJR,104,104,104,104,104
JJS,84,84,84,84,84


In [86]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
m=enc.fit_transform(matrice[:,:4]).toarray()


In [87]:
m_final=np.concatenate((m,matrice[:,4][:,None]),axis=1)

In [88]:
m_final.sum(axis=0)

array([1.0, 16.0, 23.0, ..., 1.0, 1.0, 14152], dtype=object)

In [70]:
enc.categories_

[array(['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'ak', 'al',
        'am', 'an', 'ao', 'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw',
        'ax', 'ay', 'az', 'b', 'ba', 'bb', 'bc', 'bd', 'be', 'bh', 'bi',
        'bj', 'bm', 'bo', 'bq', 'bs', 'bt', 'bv', 'bw', 'by', 'c', 'ca',
        'cc', 'cd', 'ce', 'cf', 'ch', 'ci', 'cj', 'ck', 'cl', 'cm', 'cn',
        'co', 'cp', 'cr', 'cs', 'ct', 'cw', 'cx', 'cy', 'da', 'db', 'dc',
        'dd', 'de', 'df', 'dg', 'di', 'dj', 'dm', 'dn', 'do', 'dp', 'dr',
        'ds', 'dt', 'du', 'dy', 'e', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef',
        'eg', 'ei', 'ek', 'el', 'em', 'en', 'eo', 'ep', 'er', 'es', 'et',
        'eu', 'ev', 'ew', 'ex', 'ey', 'ez', 'f', 'fc', 'fe', 'ff', 'fn',
        'fp', 'fs', 'ft', 'fu', 'fy', 'g', 'ga', 'gb', 'gc', 'ge', 'gg',
        'gh', 'gi', 'gm', 'gn', 'go', 'gs', 'gu', 'gy', 'h', 'ha', 'hc',
        'hd', 'he', 'hg', 'hi', 'hl', 'hn', 'ho', 'hr', 'hs', 'ht', 'hw',
        'hy', 'ia', 'ib', 'ic', 'id', 'ie', 

In [89]:
col=[np.array(list(zip(([enc.categories_.index(arr)+1]*len(arr)),list(arr)))) for arr in enc.categories_]

  col=[np.array(list(zip(([enc.categories_.index(arr)+1]*len(arr)),list(arr)))) for arr in enc.categories_]


In [90]:
co=np.concatenate(col)

In [91]:
co

array([['1', 'aa'],
       ['1', 'ab'],
       ['1', 'ac'],
       ...,
       ['4', 'zi'],
       ['4', 'zn'],
       ['4', 'zo']], dtype='<U21')

In [92]:
co=np.concatenate([co,[['target','target']]])

In [75]:
len(co)

1579

In [93]:
m_final.shape


(50000, 1579)

In [13]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd

#First we transform each df in the following array: first element= last two letters of the first word
                                                    #second elt = last two letters of the second word
                                                    #third elt= last two letters of the fourth word
                                                    #fourth elt = last two letters of the 5th word
                                                    #fifth element = POS of the third word
def transform(all_samples):
    trans=[]
    for df in all_samples:
        arr=np.zeros(5,dtype=object)
        arr[0]=df['word'].iloc[0][-2:]
        arr[1]=df['word'].iloc[1][-2:]
        arr[2]=df['word'].iloc[3][-2:]
        arr[3]=df['word'].iloc[4][-2:]
        arr[4]=df['POS'].iloc[2]
        trans.append(arr)
    return trans

def create_df(all_samples):

    transformed=transform(all_samples)

    #We join them in one matrix:
    matrice=np.stack(transformed)

    #We chose a class to classify: VBZ
    matrice[:,4]=1*(matrice[:,4]=='NN')

    enc = OneHotEncoder()
    m=enc.fit_transform(matrice[:,:4]).toarray()

    m_final=np.concatenate((m,matrice[:,4][:,None]),axis=1)
    col=[np.array(list(zip(([enc.categories_.index(arr)+1]*len(arr)),list(arr)))) for arr in enc.categories_]
    co=np.concatenate(col)
    co=np.concatenate([co,[['target','target']]])
    df=pd.DataFrame(m_final,columns=co)
    df=df.rename(columns={('target', 'target'):'target'})
    
    return df

In [95]:
df.sum(axis=0)

(1, aa)               1.0
(1, ab)              16.0
(1, ac)              23.0
(1, ad)             142.0
(1, ae)               2.0
                    ...  
(4, ze)              30.0
(4, zi)               3.0
(4, zn)               1.0
(4, zo)               1.0
(target, target)    14152
Length: 1579, dtype: object

In [98]:
df=df.rename(columns={('target', 'target'):'target'})

In [100]:
df.iloc[5,:][df.iloc[5,:]!=0]

(1, ht)    1.0
(2, od)    1.0
(3, es)    1.0
(4, ba)    1.0
Name: 5, dtype: object

In [102]:
all_samples[5]

Unnamed: 0,word,POS
2,right,NN
3,food,NN
4,examined,VBN
5,cases,NNS
6,cuba,NNP


In [64]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
m=enc.fit_transform(mat[:,:4]).toarray()
m_final=np.concatenate((m,mat[:,4][:,None]),axis=1)

In [79]:
pd.MultiIndex.from_arrays(zip([1,2,3,4],enc.categories_))

TypeError: unhashable type: 'numpy.ndarray'

In [76]:
len(enc.categories_)

4

In [73]:
import pandas as pd
df_final=pd.DataFrame(m_final,columns=enc.categories_+['target'])

ValueError: Shape of passed values is (50000, 1605), indices imply (50000, 5)

In [61]:
m

array([[0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1.,
        0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 1.],
       [0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0.,
        1., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 1., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        0., 0., 0.]])

In [63]:
np.concatenate((m,mat[:,4][:,None]),axis=1)

array([[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0,
        0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0],
       [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
        0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0],
       [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,
        1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0],
       [0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
        0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0],
       [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0,
        0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0]], dtype=object)

In [33]:
enc.categories_

[array(['al', 'le', 'ne', 'nt', 'ws'], dtype=object),
 array(['il', 'ns', 'on', 'ts'], dtype=object),
 array(['ar', 'ce', 'de', 'es', 'nt'], dtype=object),
 array(['al', 'ar', 'eb', 'nt', 'om'], dtype=object),
 array([0], dtype=object)]

In [24]:
#make a df with one column for each unique combination of each of the 4 first columns of matrice
list1=list(set(list(matrice[:,0])))
list2=list(set(list(matrice[:,1])))
list3=list(set(list(matrice[:,2])))
list4=list(set(list(matrice[:,3])))


In [25]:
[1]*len(list1)

[1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,


In [101]:
all_samples[5]

Unnamed: 0,word,POS
2,right,NN
3,food,NN
4,examined,VBN
5,cases,NNS
6,cuba,NNP


In [17]:
import numpy as np
a=np.zeros(5, dtype=object)

In [18]:
a[1]='re'

In [12]:
import gzip
import random


from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
import pandas as pd


import nltk
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder
import numpy as np
fulldf = create_df(all_samples)
fulldf[25000:25010]

  col=[np.array(list(zip(([enc.categories_.index(arr)+1]*len(arr)),list(arr)))) for arr in enc.categories_]


Unnamed: 0,"(1, ab)","(1, ac)","(1, ad)","(1, ae)","(1, af)","(1, ag)","(1, ah)","(1, ai)","(1, ak)","(1, al)",...,"(4, yi)","(4, yl)","(4, yo)","(4, yp)","(4, ys)","(4, yz)","(4, za)","(4, ze)","(4, zi)",target
25000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
25001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
25002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
25003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
25004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0
25005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
25006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
25007,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
25008,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
25009,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [8]:
fulldf = mc.create_df(all_samples)
fulldf[25000:25010]

NameError: name 'np' is not defined

## Part 4 - extract training and testing sets (3 points)

Here, you will create the training and testing datasets in order to train the model.  This will be based on a test percentage.  Round up if the percentage does not divide evenly into the sample size.  You will need to separate the class column into the y-values for the classifier.

In [15]:
fulldf.iloc[:,-1]

0        0
1        1
2        0
3        0
4        0
        ..
49995    0
49996    0
49997    0
49998    0
49999    1
Name: target, Length: 50000, dtype: object

In [24]:
from sklearn.model_selection import train_test_split

def split_samples(fulldf, test_percent):
    X=fulldf.iloc[:,:-1]
    y=fulldf.iloc[:,-1]
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=test_percent/100)
    return train_X, train_y, test_X, test_y

In [25]:
train_X, train_y, test_X, test_y = split_samples(fulldf, test_percent=20)
len(train_X), len(train_y), len(test_X), len(test_y)

(40000, 40000, 10000, 10000)

In [None]:
train_X, train_y, test_X, test_y = mc.split_samples(fulldf, test_percent=20)
len(train_X), len(train_y), len(test_X), len(test_y)

## Part 5 - train models (3 points)

You will then train and return two support vector machine (SVM) models using the sklearn SVC class.  You should allow a choice between linear and radial basis function kernels.

In [36]:
train_y=train_y.astype(int)

In [37]:
from sklearn.svm import SVC

def train(train_X, train_y, kernel):
    clf=SVC(kernel=kernel)
    clf.fit(train_X, train_y)
    return clf

In [None]:
model_linear = train(train_X, train_y, kernel='linear')
model_rbf = train(train_X, train_y, kernel="rbf")
model_linear, model_rbf

In [None]:
model_linear = mc.train(train_X, train_y, kernel='linear')
model_rbf = mc.train(train_X, train_y, kernel="rbf")
model_linear, model_rbf

## Part 6 - evaluate the models (5 points)

You will calculate and print precision, recall, and F-measure for the models on the test data. In `notes.md`, write down your comparison of these simple measures on the two models and any thoughts you might have on what they mean. (It could be very short, and since the samples do not stay stable between runs, you can save the evaluation scores in `notes.md` too.)

In [None]:
from sklearn.metrics import precision_recall_fscore_support
def eval_model(model_linear, test_X, test_y):
    y_pred=model_linear.predict(test_X)
    prec,recall,fscore,_=precision_recall_fscore_support(test_y,y_pred)
    return prec,recall,fscore

In [None]:
eval_model(model_linear, test_X, test_y)
eval_model(model_rbf, test_X, test_y)

In [None]:
mc.eval_model(model_linear, test_X, test_y)
mc.eval_model(model_rbf, test_X, test_y)

## Part Bonus - try another sort of model from sklearn (5 points)

Write a separate, command-line script (not a notebook) uses `mycode.py` to do all of the above, except that it trains a non-SVM classifier model.  Any non-trivial model available in sklearn will do. Explain how to run your code and the results of your own evaluation in `notes.md`, including any observations or opinions you may have on the classifier method you used in comparison to SVM.

## Submission

Push to your fork of the GitHub repository (which must be made public) and submit the URL of your repository in Canvas.  You can submit this notebook with the output from your run, as long as you do not modify the code or text in it without permission from us.  

In [None]:
from sklearn.naive_bayes import GaussianNB
def eval_Naive_Bayes(train_X, train_y, test_X, test_y):
    gnb = GaussianNB()
    y_pred = gnb.fit(X_train, y_train).predict(X_test)
    prec,recall,fscore,_=precision_recall_fscore_support(test_y,y_pred)
    return prec,recall,fscore