#### Assignment 1 (NLU 22)
##### Sentiment analysis using logistic regression

In this assignment you train a binary classifier for classifying movie reviews from IMDB database. The task is to classify them as positive or negative. The task has been divided into several subtasks. 

1. You will need the following libraries. Do not import everything.  
    * re
    * pandas
    * numpy
    * scipy
    * nltk
    * scikit-learn
2. The data consists of two directories--positive and negative reviews.   
3. Each review is a document. 
4. Process the texts so that you get rid of punctuation but keeping spaces. We have to be careful with stopwords. Completly removing them may lead to loss of crucial information. (How?) 
5. You will have to map each document (email) to a vector. 
6. You will need to use **tf/idf** weighting. You should create the tf-idf vectors from scratch. **Do not use library functions**. 
7. Once you have the vectors for each document apply logistic regression to the training set to fix the weights. You *may* use **sklearn** logistic regression function from linear models.  
8. Test your model with the test set and report accuracy, recall and precision. 
9. Write a short report (about 250 words) on the model and how one may improve it. 
10. You will be provided with a set of functions. Your task is to complete them. 
11. Do not change anything in the structure of these functions. If you have print functions for testing comment them out. 
12. Commenting your code is important. But not too much commenting. 
13. The following are the basic steps:
    1. process the dataset.
    2. build a vocabulary
    3. convert documents to vectors by **tf/idf** weighting
    4. match the input/output vectors
    5. train the model (use logistic regression from sk-learn linear models)
    6. test the model and compute performance measures
    7. write short report, perhaps start with the report

In [8]:
import os, re
import numpy as np
import pandas as pd
import time
import random
from pathlib import Path
from collections import defaultdict
# these are the libraries you might need. numpy, random will be needed
# defaultdict will also prove useful

In [None]:
#check out the following. to help you with files
r_dir = os.walk("imdb_dataset").__next__()[0]
d_dir = os.walk("imdb_dataset").__next__()[1]  
data_dirs = [os.path.join(r_dir, d1)  for d1 in d_dir]
data_dirs

##### Hints    

1. The list data_dirs just contains the directories inside the current location.  
2. For example, on Windows os the list should look like *['imdb_dataset\\neg2', 'imdb_dataset\\pos2']*. 
3. Store the imdb_dataset wherever you are working. **Do not change the names**. Best to create a separate directory for the assignment and have everything there. 
4. Do not use absolute paths. If, for example, there is a file "100.txt" in the positve directory the path is "imdb_dataset\\pos2\\100.txt". This is the path you should call. 
5. 

1. The function below splits the two directories (*neg* and *pos*)in ratio 3:1 (75% to 25%) approximately and combines them into 2 lists, say *train* and *test*.
2. Both train and test lists should contain positive and negative reviews in approximately equal numbers. Suppose ther are 100 files in "pos" and 105 file in "neg". Your "train" list shold contain arond 150 files about 75 of which are names of postive files. 
2. You should not create lists containing the texts, only name of the files. You call them when needed. 
3. Please follow the directory structure given here. 
    1. The notebook file and the tob directory for the data **imdb_rev** should be in the same directory. 
    2. **imdb_rev** contains 2 directories *neg* and pos containing negative and positive reviews respectively. 
    3. So the file-path is "imdb_rev/neg/filename" or "imdb_rev/pos/filename". 
 4. Yoy may try random.sample method from library. 
 5. You should mix up the lists a bit, perhaps using random.shuffle. 
 6. Since the train/test list contains both types of files you should also keep the "sentiment" information. One possibility is to store theem as tuples.Suppose "00.txt" is picked up from the "neg" directory for the train list. We keep it as ("00.txt", 0). This will also help you in building the output (**y**-vector). 

In [1]:
import os
def train_test_split(): 
#the following lists contain the file names in the 'neg' and 'pos' directories respectively
    neg_dir = os.walk(data_dirs[0]).__next__()[2]
    pos_dir = os.walk(data_dirs[1]).__next__()[2]
    return tr, ts

##### Hints
1. Explore the function os.walk(*pathname*). 
2. Try os.walk(*pathname*). __next__(). It gives a list. The first member is the name of the current directory. The **3rd** memebr is the list files in the current directory. Explore. 
3. Your train-test split should contain the pathnames to the files, not the files themeselves. 
4. If the instructions are followed the list variable "neg-dir" above contains the list of file names in negative directory "neg2". 
5. The train and test lists must contain about equal number of file names from positive and negative directory. 
6. So the file path is not enough. It must also store the information about the "sentiment", whether the file came from negative or positive directory. One simple way to do it is to store a tuple: *(filepath, sentiment)*. You may call the sentiments 0 and 1. For example, if the file "007.txt" came from negative directory, the list stores ("imdb_dataset/neg2/007.txt", 0) in a linux os or Mac os. You do not have to worry about the os, Python does it automatically. 
7. Remember to shuffle the list. You may try random.sample(...) (for sampling) and randm.shuffle(...). 

In [None]:
from nltk.corpus import stopwords
stopWords = sorted(list(stopwords.words('english')))
#this gives you all the stopwords including negations like "no", "not", "should'nt"
#you must decide how to deal with them, they could be important for the sentiment

*One simple possible way is to modify the list* stopWords and *remove words like "not", "no", "wouldn't", "couldn;t" etc.* and *create a new list of stopwords*. Use this modified list to get rid of stopwords. 

1. The following function builds the vocabulary for the training set. 
2. Vocabulary is the **set** of tokens in all the texts in the "train" list. 
3. Here you have to deal with stopwords. 
4. It may prove useful to have an auxiliary function, say *proc_text(txt)*. It takes text string as input and produces a *set* of tokens getting rid of the words in modified stopwords list. See below for a possible way of defining such a function. 

In [None]:
# "stopWords2" is the new list of stopwords which does not have the "negation" words. 
#this is an auxilliary function
def proc_text(txt):
    #split the text into tokens, getting rid of all punctuation, use re.split() 
    #add the empty string '' to the stopWords2 since splitting may produce the empty string 
    #list of tokens in "txt"
    tok_list = []
    pass
#your code goes here. Store the tokens in "txt" but not in stopWords2 in tok_list.  
    return tok_list
#remember the function returns a list so there may be duplications. 

In [None]:
def build_vocab(tr):
    #vocab = set()
    ##you may need to use the function proc_text
    ## the vocab_dict updates document frequency, remember a token may ocuur multiple times in a document
    ## but it is counted as 1.
    # update vocab dict (set as keys, number counted as value) --> look at hints2
    vocab_dict = defaultdict(int)
    pass
#your code 
    return vocab_dict

##### Hints
1. Suppose you have made the train and test list of *filepaths*. 
2. Use the train list to build vocabulary. If you have stored the *filepath* and sentiment as a pair (as sugeested above) then you need the first entry of the pair.  
3. Open the files one-by-one and process. **Close** a file before opening the next. The built-in method "with open(...)" may be useful. Remember to include encoding. 
4. Get rid of the stopwords, except the negation types (see above). You may use the auxilliary function. 

1. Te following function is perhaps the most important for this assignment. Now that we have the vocabulary we create the tf-idf vector for each document. 
2. This is the map taking a document to a vector. 
3. The size of the vector will eqaul the length of the vocabulary. 
4. Put the vector as a *row vector* in the input matrix. 
5. Suppose the training set has $m$ documents and the size of the vocabulary has $n$ tokens. The input matrix is of the order $m\times n$. 
6. Remember each row represents a document. 
7. You should simultaneausly create the output vector $y$ representing the sentiment of the corresponding document. It is a column vector with dimension $m$. For example, a "train-set" with 4 files and vocabulary consisting of 5 tokens will have the form:

$$
X = 
\begin{pmatrix}
x_{11} & x_{12} & x_{13} & x_{14} & x_{15} \\
x_{21} & x_{22} & x_{23} & x_{24} & x_{25} \\
x_{31} & x_{32} & x_{33} & x_{34} & x_{35} \\
x_{41} & x_{42} & x_{43} & x_{44} & x_{45} 
\end{pmatrix}
\longrightarrow
y = \begin{pmatrix}
0\\
1\\
0\\
1
\end{pmatrix} 
$$
8. Here the first row represents the tf-idf vector for document 1 which is negative (0), the second row for document 2 (positive) etc.  

9. **Do not use any readymade library functions for tf-idf**. 

In [7]:
def tf_idf_matrix(train, vocab_dict):
    nrow = len(train)
    ncol = len(vocab)
    toke_dict = defaultdict(int)
#update toke_dict
    X = np.zeros((nrow, ncol), dtype=np.float32)
    y = np.zeros((nrow, 1), dtype=np.float32)
    for indx in range(nrow):
        tok_list = [] 
        pass
    ## your code, update tok_list, you may use the "proc_text" function here
        for token in toke_list:
            if vocab_dict[toke] == 0:
            #this is for the test data, if the token is not there in the vocab ignore it
                continue
            pass
        #compute both the tf-value and idf-value and update X, y
    return X, y

##### Hints
1. The initial part of the function is similar to building the vocabulary. The document frequency is already there in the output of *build_vocab* function. Call that function seprately and store its output. 
2. First update "tok_dict" using the set *vocab_dict* that is passed on to the function. The index will tell yo where to store the tf-idf values in the matrix X. 
3. Recall that the entris in the "train" list are pairs, the first is the *filepath* and the second is the sentiment. 
4. It is best to use numbers remembering that the rows represent document index in the train list. For example, if train[0] = (*filepath*, 0) then the tf-idf values of the document in *filepath* will be stored in row 0 of X and y[0] = 0. 
5. You have to tokenize and process the documents **one-by-one** as in the case of building vocabulary. 
    1. Open a file.
    2. Tokenize it, getting rid of the words in stopWords2. You should have a list of tokens, say "tok_list, in the document. 
    3. Count the occurence of each token. You may use built-in "Counter" from Python 3 *collections". 
    4. Clculate tf-value and store it in X at (*indx*, *tok_dict[token]*). Recall *tok_dict* stores the index of the tokens. 
    5. Comput the idf-value. 
    6. Update the appropriate entry of X. 


1. There are 2 more functions for training and testing using **X** and **y** above.
2. These will mostly involve library functions. 
3. You may use any intermediate "helper" functions if necessary.

In [None]:
# for training data
def apply_logit(X, yt):
    from sklearn.linear_model import LogisticRegression
    pass
#your code goes here, "es" is the output of the sklearn LogisticRegression function
    return es

1. Test the model with the test data X, y. 
2. You have to create the tf-idf features for the the documents in the test files. Be careful while creating the feature vectors for test files. There may be some tokenns in these files which are not there in the vocabulary. 
3. You should use the output of the previous function
4. hint:look at the "predict" function in LogisticRegression
5. Now compare the class prediction of the model with the corresponding value given by y. 
6. Compute the performance metrics *accuracy*, *precision", *recall* and *F1-score* and report. 


In [None]:
param = apply_logit(X, yt)

In [None]:

def test_model(X, y):
    #use the object "param" above, this is the output after training
    # you should call the function tf_idf_matrix with the test-list 
    acc = 0
    prec = 0
    recall = 0
    f1 = 0
    pass
#your code
    return acc, prec, recall, f1
    

For your convenience the definition of metrics given below. 

#### Metrics 
1. $\text{precision} = \frac{tp}{tp + fp} $
2. $\text{recall} = \frac{tp}{tp + fn} $
3. $\text{accuracy} = \frac{tp+tn}{tp +tn+ fp+fn} $
4. $$ F_\beta =\frac{(1 + \beta^2)\cdot tp}{(1 + \beta^2)\cdot tp + \beta^2\cdot fn + fp}$$
5. $$F_1 = \frac{tp}{tp + (fp+fn)/2}$$

1. $tp = \text{number of "true positive" } $ 
2. $tn = \text{number of "true negative" }$
3. $fp = \text{number of "false positive" }$ 
4. $fn = \text{number of "false negative" }$