#### Assignment 1 (NLU 22)
##### Sentiment analysis using logistic regression

In this assignment you train a binary classifier for classifying movie reviews from IMDB database. The task is to classify them as positive or negative. The task has been divided into several subtasks. 

1. You will need the following libraries. Do not import everything.  
    * re
    * pandas
    * numpy
    * scipy
    * nltk
    * scikit-learn
2. The data consists of two directories--positive and negative reviews.   
3. Each review is a document. 
4. Process the texts so that you get rid of punctuation but keeping spaces. We have to be careful with stopwords. Completly removing them may lead to loss of crucial information. (How?) 
5. You will have to map each document (email) to a vector. 
6. You will need to use **tf/idf** weighting. You should create the tf-idf vectors from scratch. **Do not use library functions**. 
7. Once you have the vectors for each document apply logistic regression to the training set to fix the weights. You *may* use **sklearn** logistic regression function from linear models.  
8. Test your model with the test set and report accuracy, recall and precision. 
9. Write a short report (about 250 words) on the model and how one may improve it. 
10. You will be provided with a set of functions. Your task is to complete them. 
11. Do not change anything in the structure of these functions. If you have print functions for testing comment them out. 
12. Commenting your code is important. But not too much commenting. 
13. The following are the basic steps:
    1. process the dataset.
    2. build a vocabulary
    3. convert documents to vectors by **tf/idf** weighting
    4. match the input/output vectors
    5. train the model (use logistic regression from sk-learn linear models)
    6. test the model and compute performance measures
    7. write short report, perhaps start with the report

In [None]:
import os, re
import numpy as np
import pandas as pd
import time
import random
from pathlib import Path
from collections import defaultdict
# these are the libraries you might need. numpy, random will be needed
# defaultdict will also prove useful

In [None]:
#check out the following. to help you with files
r_dir = os.walk("imdb_dataset").__next__()[0]
d_dir = os.walk("imdb_dataset").__next__()[1]  
data_dirs = [os.path.join(r_dir, d1)  for d1 in d_dir]
data_dirs

1. The function below splits the two directories (*neg* and *pos*)in ratio 3:1 (75% to 25%) approximately and combines them into 2 lists, say *train* and *test*.
2. Both train and test lists should contain positive and negative reviews in approximately equal numbers. Suppose ther are 100 files in "pos" and 105 file in "neg". Your "train" list shold contain arond 150 files about 75 of which are names of postive files. 
2. You should not create lists containing the texts, only name of the files. You call them when needed. 
3. Please follow the directory structure given here. 
    1. The notebook file and the tob directory for the data **imdb_rev** should be in the same directory. 
    2. **imdb_rev** contains 2 directories *neg* and pos containing negative and positive reviews respectively. 
    3. So the file-path is "imdb_rev/neg/filename" or "imdb_rev/pos/filename". 
 4. Yoy may try random.sample method from library. 
 5. You should mix up the lists a bit, perhaps using random.shuffle. 
 6. Since the train/test list contains both types of files you should also keep the "sentiment" information. One possibility is to store theem as tuples.Suppose "00.txt" is picked up from the "neg" directory for the train list. We keep it as ("00.txt", 0). This will also help you in building the output (**y**-vector). 

In [None]:
def train_test_split(): 
#the following lists contain the file names in the 'neg' and 'pos' directories respectively
    neg_dir = os.walk(data_dirs[0]).__next__()[2]
    pos_dir = os.walk(data_dirs[1]).__next__()[2]
    return tr, ts

In [None]:
from nltk.corpus import stopwords
stopWords = sorted(list(stopwords.words('english')))
#this gives you all the stopwords including negations like "no", "not", "should'nt"
#you must decide how to deal with them, they could be important for the sentiment

1. The following function builds the vocabulary for the training set. 
2. Vocabulary is the **set** of tokens in all the texts in the "train" list. 
3. Here you have to deal with stopwords. 

In [None]:
def build_vocab(tr):
    vocab = set()
    pass
#your code 
    return vocab

1. The following function is perhaps the most important for this assignment. Now that we have the vocabulary we create the tf-idf vector for each document. 
2. This is the map taking a document to a vector. 
3. The size of the vector will eqaul the length of the vocabulary. 
4. Put the vector as a *row vector* in the input matrix. 
5. Suppose the training set has $m$ documents and the size of the vocabulary has $n$ tokens. The input matrix is of the order $m\times n$. 
6. Remember each row represents a document. 
7. You should simultaneausly create the output vector $y$ representing the sentiment of the corresponding document. It is a column vector with dimension $m$. For example, a "train-set" with 4 files and vocabulary consisting of 5 tokens will have the form:

$$
X = 
\begin{pmatrix}
x_{11} & x_{12} & x_{13} & x_{14} & x_{15} \\
x_{21} & x_{22} & x_{23} & x_{24} & x_{25} \\
x_{31} & x_{32} & x_{33} & x_{34} & x_{35} \\
x_{41} & x_{42} & x_{43} & x_{44} & x_{45} 
\end{pmatrix}
\longrightarrow
y = \begin{pmatrix}
0\\
1\\
0\\
1
\end{pmatrix} 
$$
8. Here the first row represents the tf-idf vector for document 1 which is negative (0), the second row for document 2 (positive) etc. 

In [None]:
def tf_idf_matrix(train, vocab):
    nrow = len(train)
    ncol = len(vocab)
    X = np.zeros((nrow, ncol), dtype=np.float32)
    y = np.zeros((nrow, 1), dtype=np.float32)
    pass
## your code
    return X, y

1. There will be 2 or 3 more functions for training and testing using **X** and **y** above.
2. These will mostly involve library functions. 
3. You may use any intermediate "helper" functions if necessary. But they must be called inside one of the above functions only. 