# Feature Engineering Homework 
***
**Name**: $<$Rajarshi Basak$>$ 

**Kaggle Username**: $<$insert username here$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [4]:
from sklearn.base import BaseEstimator, TransformerMixin

class LowerAlphabetTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r',
                   's','t','u','v','w','x','y','z']
         
        # Initiaize matrix 
        X = np.zeros((len(examples), len(letters)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(letter) for letter in letters])
            
        return csr_matrix(X)

In [5]:
from sklearn.base import BaseEstimator, TransformerMixin

class UpperAlphabetTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        letters = ['A','B','C','D','E','F','G','H','I','J',
                   'K','L','M','N','O','P','Q','R','S','T','U','V','X','Y','Z']
         
        # Initiaize matrix 
        X = np.zeros((len(examples), len(letters)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(letter) for letter in letters])
            
        return csr_matrix(X)

In [6]:
from sklearn.base import BaseEstimator, TransformerMixin

class DigitTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        numbers = ['0','1','2','3','4','5','6','7','8','9']
         
        # Initiaize matrix 
        X = np.zeros((len(examples), len(numbers)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(number) for number in numbers])
            
        return csr_matrix(X)

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin

class SpecialCharTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        chars = ["'", '"', '!', "...", ',', '?','~']
        
        # Initiaize matrix 
        X = np.zeros((len(examples), len(chars)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(char) for char in chars])
            
        return csr_matrix(X)

In [8]:
from sklearn.base import BaseEstimator, TransformerMixin

class SpaceTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        spaces = [' ','  ', '   ', '    ']
        
        # Initiaize matrix 
        X = np.zeros((len(examples), len(spaces)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(space) for space in spaces])
            
        return csr_matrix(X)

In [9]:
class ItemSelector(BaseEstimator, TransformerMixin):
   
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

In [None]:
import re
import numpy as np 
from scipy.sparse import csr_matrix
from scipy.sparse import coo_matrix, hstack
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import matplotlib.pylab as plt
from sklearn.pipeline import Pipeline
        
class FeatEngr:
    def __init__(self):
        
        
        return None;
        #self.vectorizer = CountVectorizer(stop_words = 'english', ngram_range=(1,4))

    def train_test_modifier(self, examples, trope, page):
        
        self.trope1 = []
        self.page1 = []
        
        for j in trope:
            j = re.findall('[A-Z][^A-Z]*', j)
            j = " ".join(j)
            self.trope1.append(" " + j + " ")
            
        for j in page:
            #j = re.findall('[A-Z][^A-Z]*', j)
            #j = " ".join(j)
            self.page1.append(" " + j + " ")
        
        newex = dict()
        
        newex['sentence'] = examples
        newex['trope1'] = self.trope1
        newex['page1'] = self.page1
        
        return newex
    
    def build_train_features(self, examples, trope, page):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts  
        """
        page1 = []
        self.page1 = []
        page2 = []
        self.trope1 = []
        trope2 = []
        
        
        
        #Combining features using FeatureUnion
        
        
        self.allmyfeatures = FeatureUnion([ ('sentence_bagOfWords', Pipeline([
                                                ('sentence_sel1', ItemSelector(key='sentence')),
                                                ("bag-of-words-sent", CountVectorizer(stop_words='english', 
                                                                            ngram_range=(-1,6), lowercase = False))])),
                                            ('sentence_tfidf', Pipeline([
                                                ('sentence_sel2', ItemSelector(key='sentence')),
                                                ("tf_idf", TfidfVectorizer(min_df = 0.001, max_df = 0.999, norm = 'l2'))])),
                                            ('sentence_loweralpha', Pipeline([
                                                ('sentence_sel3', ItemSelector(key='sentence')),
                                                ("lower-alphabets", LowerAlphabetTransformer())])),
                                            #('sentence_spacer', Pipeline([
                                            #    ('sentence_sel4', ItemSelector(key='sentence')),
                                            #    ("spaces", SpaceTransformer())])),
                                            #('sentence_specialchar', Pipeline([
                                            #    ('sentence_sel5', ItemSelector(key='sentence')),
                                            #    ("specialchar", SpecialCharTransformer())])),
                                            ('trope_OneHotEncoding', Pipeline([
                                                ('trope_sel1', ItemSelector(key='trope1')),
                                                ("bag-of-words-trope", CountVectorizer(stop_words='english',ngram_range=(0,3), lowercase = False))])),
                                            #('trope_loweralpha', Pipeline([
                                            #    ('trope_sel2', ItemSelector(key='trope1')),
                                            #    ("lower-alphabets-trope", LowerAlphabetTransformer())])),
                                            #('trope_upperalpha', Pipeline([
                                            #    ('trope_sel3', ItemSelector(key='trope1')),
                                            #    ("upper-alphabets-trope", UpperAlphabetTransformer())])),
                                            #('page_OneHotEncoding', Pipeline([
                                            #    ('page_sel1', ItemSelector(key='page1')),
                                            #    ("bag-of-words-page", CountVectorizer())]))
                                            #('page_tfidf', Pipeline([
                                            #    ('page_sel2', ItemSelector(key='page1')),
                                            #    ("tf_idf", TfidfVectorizer())])),
                                            #('page_digits', Pipeline([
                                            #    ('page_sel2', ItemSelector(key='page1')),
                                            #    ("digits-page", DigitTransformer())]))
                                          ]);
        
        newex = dict()
        
        newex = self.train_test_modifier(examples,trope, page)
        
        Z1 = self.allmyfeatures.fit_transform(newex)
        
        #print("Z1 has type ", type(Z1))
        print("The final matrix with all features has shape ", Z1.shape)
        
        return Z1

    def get_test_features(self, examples, page, trope):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        '''A1 = self.allmyfeatures.transform(examples)
        A2 = self.allmyfeatures.transform(page)
        A3 = self.allmyfeatures.transform(trope)'''
        
        #Af = np.concatenate((A1, A2, A3), axis=1)
        
        newex = dict()
        
        '''newex['sentence'] = examples
        newex['trope1'] = self.trope1'''
        
        newex = self.train_test_modifier(examples,trope, page)
        
        
        return self.allmyfeatures.transform(newex)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression 
        from sklearn.model_selection import GridSearchCV
        from sklearn.model_selection import train_test_split
        from sklearn.model_selection import cross_val_score
        from sklearn.model_selection import learning_curve
        
        # load data 
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        dfTrain.head()
        
        # get training features and labels 
        self.X_train = self.build_train_features(list(dfTrain["sentence"]), list(dfTrain["trope"]), list(dfTrain["page"]))
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        
        #Do train test split using sklearn package
        X_train, X_test, y_train, y_test = train_test_split(self.X_train, self.y_train, test_size=0.2, random_state=1230)
        
        print (X_train.shape)
        print (y_train.shape)
        print (X_test.shape)
        print (y_test.shape)
        print ("The type of X_train is " ,type(X_train))
        
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        #self.logreg.fit(self.X_train, self.y_train)
        self.logreg.fit(X_train, y_train)
        

        #Performance on train-test-split on training data
        print("Accuracy on training data = {:.3f}".format(self.logreg.score(X_train, y_train)))
        print("Accuracy on validation data = {:.3f}".format(self.logreg.score(X_test, y_test)))
        

        #Performance on cross-validation on training data
        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv=5)
        print(scores)
        print("Mean Accuracy in Cross-Validation = {:.3f}".format(scores.mean()))
        
        ylim=None
        title = "Learning Curve"
        
        
        #Plotting the learning curve for our model
        plt.figure()
        plt.title(title)
        if ylim is not None:
            plt.ylim(*ylim)
        plt.xlabel("Number of Training examples")
        plt.ylabel("Errors")
        train_sizes, train_scores, test_scores = learning_curve(
            self.logreg, self.X_train, self.y_train, cv=5, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5))
        train_scores_mean = np.mean(train_scores, axis=1)
        train_scores_std = np.std(train_scores, axis=1)
        test_scores_mean = np.mean(test_scores, axis=1)
        test_scores_std = np.std(test_scores, axis=1)
        plt.grid()

        plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                         train_scores_mean + train_scores_std, alpha=0.1,
                         color="r")
        plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                         test_scores_mean + test_scores_std, alpha=0.1, color="g")
        plt.plot(train_sizes,1 - train_scores_mean, 'o-', color="r",
                 label="Training error")
        plt.plot(train_sizes,1 - test_scores_mean, 'o-', color="g",
                 label="Cross-validation error")

        plt.legend(loc="best")
        plt.show()
        
    def error_analysis(self):
        
        from sklearn.cross_validation import train_test_split
        from sklearn.metrics import accuracy_score
        from sklearn.metrics import classification_report
        from sklearn.linear_model import LogisticRegression
        
        
        # Fit Model to Train Data
        limit = .5
        test_size = .2
        train = pd.read_csv("../data/spoilers/train.csv")
        # Split train data into train and validation data (also shuffles rows)
        
        train_limited = train.sample(frac=limit)
        X_train, X_val, y_train, y_val = train_test_split(train_limited, train_limited['spoiler'], test_size=test_size)
        xtr = self.build_train_features(list(X_train["sentence"]),list(X_train["trope"]), list(X_train["page"]))
        ytr = np.array(y_train, dtype=int)
        newex3 = self.train_test_modifier(list(X_val["sentence"]), list(X_val["trope"]), list(X_val["page"]))
        xval = self.allmyfeatures.transform(newex3)
        yval = np.array(y_val, dtype=int)
        self.logreg = LogisticRegression(random_state=1234)
        self.logreg.fit(xtr, ytr)
        
        pred_val = self.logreg.predict(xval)
        print ("Validation Accuracy: ", accuracy_score(yval, pred_val, 0))
        
        # Error Analysis
        report = classification_report(y_val, pred_val)
        print(report)
    
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        
        # read in test data 
        dfTest  = pd.read_csv("../data/spoilers/test.csv")
        #dfTest.head()
        
        # featurize test data 
        self.X_test = self.get_test_features(list(dfTest["sentence"]), list(dfTest["page"]), list(dfTest["trope"]))
       
        print ("The shape of X_test is ", self.X_test.shape)
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction-17.csv", index=True, index_label="Id")

In [None]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

The final matrix with all features has shape  (11970, 472182)
(9576, 472182)
(9576,)
(2394, 472182)
(2394,)
The type of X_train is  <class 'scipy.sparse.csr.csr_matrix'>
Accuracy on training data = 0.999
Accuracy on validation data = 0.713
[ 0.67390397  0.68267223  0.64369256  0.65649812  0.65733389]
Mean Accuracy in Cross-Validation = 0.663


In [None]:
# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()

In [None]:
df = pd.read_csv("../data/spoilers/train.csv", sep = ',', names = ['sentence', 'spoiler', 'page', 'trope'] )
df.head()


### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

**Solution to Motivation and Analysis:**<br>
- For testing the models, a train-validation split of 80-20 was used on the training set, and then a 5-fold cross-validation set was used from the training set to evaluate the mean cross-validation score.

- On trying the CountVectorzier (which builds a bag-of-words model of unigrams/words from the vocabulary) with a list of in-built stop-words in English, the accuracies on the train set and validation set were 93.9% and 66.2% respectively, while the mean accuracy in cross-validation was 59.6%.

- When n-grams with higher values of n (eg. n = 2 and n = 3) were tried on the CountVectorizer, the accuracies were seen to rise, since bigrams and trigrams capture whether a review is a spoiler or not better that just single words. In addition, a TF-IDF Vectorizer was also used since it gives more weight to terms or words that appear frequently in a particular example or sentence but appear infrequently in many sentences across the entire dataset. A Custom Transformer for Alphabets (Lowercase and Uppercase) and Digits (0-9) similar to the XYZTransformer shown in class was also built, and the output csr matrices from these three feature builders (i.e. CountVectorizer, TF-IDF Vectorizer, and the Alphabet and Digit Transformer) were concatenated using FeatureUnion into a single matrix. Since for n-grams higher than n = 5, there was a considereabe drop in the accuracy, the n-grams used here were n = 1 to n = 5. The accuracies on the train set and validation set were 99.5% and 67.7% respectively, while the mean accuracy in cross-validation was 62.8%

- It was observed that the highest accuracy from the CountVectorizer was for n = 1 to n = 4 (since the higher n-grams were leading to a negative correlation as not too many sentences which were marked spoilers had a common set of n-grams with n = 6,7 or 8). Also, the upper-case alphabets turned out to be redundant features, since an Upper-case alphabet appears only at the start of a sentence or for a common or proper noun, which could appear in Spoiler or a Non-spoiler. On the other hand, a lowercase letter is far more likely to be a good predictor (than an uppercase letter) since certain words which are common in Spoilers (eg. kills) have a larger number of a lowercase letter. Similarly, digits were also redundant features as there wasn't a higher or lower concentration of digits in either spoilers or non-spoilers. However, it was observed that many of the sentences that were spoilers seemed to have more than one whitespace between consecutive words in a sentence, and hence 2, 3, and 4 consecutive whitespaces was added as features. For this new feature set, the accuracies on the train set and validation set were 99.3% and 72.0% respectively, while the mean accuracy in cross-validation was 68.6%.

- In the next iteration, the tropes field from the dataset was used. Regular expressions was employed to split each of the trope entries to the individual words in them. Finally the CountVectorizer (with unigrams) was used to build a bag-of-words and this time a pipeline was used to pass each of the hand-built features from the previous steps, including the CountVectorizer for the tropes. This time, the accuracies on the train set and validation set were 99.8% and 75.4% respectively, while the mean accuracy in cross-validation was 71.4%. The increase in accuracy proves that tropes is a good predictor for a review, which should be the case since certian tv-shows are likely to have tropes that could naturally have possible spoilers (eg. action/thriller or detective) whereas other tv shows like comedy and reality shows are far less likely to have possible spoilers.

- Finally, some of the parameters that were passed in CountVectorizer were fined tuned, and the accuracies on the train set and validation set were 99.8% and 76.6% respectively, while the mean accuracy in cross-validation was 72.0%.

In [None]:
#Error Analysis
feat.error_analysis()

The table above shows the classification report of our classifier, which includes the precision, recall, f1-score for the Spoilers (True) and Non-spoilers (False), and mean scores of the same for both the Spoilers and the Non-spoilers combined.

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 