# Feature Engineering Homework 
***
**Name**: $<$Xu Han$>$ 

**Kaggle Username**: $<$xuhan$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
from nltk import ngrams
from nltk.corpus import stopwords
from gensim import corpora, models
import gensim
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [2]:
#calculate the length of the sentence    
class LengthTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.feature_names = []
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
                 
        # Initiaize matrix 
        X = np.zeros((len(examples), 1))
        
        # Loop over examples and count length
        for ii, x in enumerate(examples):
            X[ii, :] = np.array([len(x)])
            self.feature_names.append('length'+str(X[ii,:]))
        return csr_matrix(X)
 
    def get_feature_names(self):
        return self.feature_names

    
    
#LDA model    

class LDATransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.feature_names = []
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def get_feature_names(self):
        return self.feature_names
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        tokenizer = RegexpTokenizer(r'\w+')
        
        # create English stop words list
        en_stop = set(stopwords.words('english'))
        
        # Create p_stemmer of class PorterStemmer
        p_stemmer = PorterStemmer()
        
        # list for tokenized documents in loop
        texts = []
        
        # Initiaize matrix 
        X = np.zeros((len(examples), 1))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            # clean and tokenize document string
            raw = x.lower()
            tokens = tokenizer.tokenize(raw)
            
            # remove stop words from tokens
            stopped_tokens = [i for i in tokens if not i in en_stop]
            
            # stem tokens
            stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
            
            # add tokens to list
            texts.append(stemmed_tokens)
            
        dictionary = corpora.Dictionary(texts)
        corpus = [dictionary.doc2bow(text) for text in texts]
        self.ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary)
        
        for ii, x in enumerate(examples):
            # clean and tokenize document string
            raw = x.lower()
            tokens = tokenizer.tokenize(raw)
            
            # remove stop words from tokens
            stopped_tokens = [i for i in tokens if not i in en_stop]
            
            # stem tokens
            stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
            

            bow = self.ldamodel.id2word.doc2bow(stemmed_tokens)
            doc_topics, word_topics, phi_values = self.ldamodel.get_document_topics(bow,per_word_topics=True)
            
        
            X[ii,:] = np.array(doc_topics[0][1])
            
            self.feature_names.append('lda'+str(X[ii,:]))
            #print(X[ii,:])
            
        #X = preprocessing.normalize(X, norm='l2')
        return csr_matrix(X)
    
    def topics(self,x):
        tokenizer = RegexpTokenizer(r'\w+')
        
        # create English stop words list
        en_stop = set(stopwords.words('english'))
        
        # Create p_stemmer of class PorterStemmer
        p_stemmer = PorterStemmer()
        
        # clean and tokenize document string
        raw = x.lower()
        tokens = tokenizer.tokenize(raw)
            
        # remove stop words from tokens
        stopped_tokens = [i for i in tokens if not i in en_stop]
            
        # stem tokens
        stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
            

        bow = self.ldamodel.id2word.doc2bow(stemmed_tokens)
        doc_topics, word_topics, phi_values = self.ldamodel.get_document_topics(bow,per_word_topics=True)
        print(doc_topics)
    
    def show(self):
        print(self.ldamodel.show_topics())


In [15]:
class FeatEngr:
    def __init__(self):
        
        
        
        #self.vectorizer = CountVectorizer(stop_words='english')  # "bag-of-words" Accuracy: 0.67 (+/- 0.04)
        #self.vectorizer_t = TfidfVectorizer(analyzer='word',ngram_range=(1,2), lowercase=True, norm='l2',stop_words='english')   # "bag-of-words-tfidf" Accuracy: 0.682206 (+/- 0.011113)
        
        #extract features from different columns
        self.vectorizer = FeatureUnion( 
        [       
                ('bag of words', 
                  Pipeline([('extract_field', FunctionTransformer(lambda x: x[0], validate = False)),
                            ('tfid', TfidfVectorizer(analyzer='word',ngram_range=(1,2), lowercase=True, norm='l2',stop_words='english'))])),              
                #('type of trope', 
                #  Pipeline([('extract_field', FunctionTransformer(lambda x: x[1], validate = False)),
                #            ('trope', TfidfVectorizer())])),
                ('length of sentence',
                 Pipeline([('extract_field', FunctionTransformer(lambda x:  x[0], validate = False)), 
                            ('length', LengthTransformer())])),  
                ('LDA model',
                  Pipeline([('extract_field', FunctionTransformer(lambda x:  x[0], validate = False)), 
                            ('lda', LDATransformer())])),    
        ])
    
    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.transform(examples)


                
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
         
        
        # load data 
        dfTrain = pd.read_csv("/home/yichen/Dropbox/CSCI 5922/Alexa-testing-tool/crawlerdata/train.csv")
        # get training features and labels 
        self.X_train = self.build_train_features([list(dfTrain["response"])])
        self.y_train = np.array(dfTrain["label"], dtype=int)
        
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)
        
        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv =10)
        print("train CV: %0.6f (+/- %0.6f)" % (scores.mean(), scores.std() * 2))
        
            
            
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        
        # read in test data 
        dfTest  = pd.read_csv("/home/yichen/Dropbox/CSCI 5922/Alexa-testing-tool/crawlerdata/test.csv")
        
        # featurize test data 
        self.X_test = self.get_test_features([list(dfTest["response"])])
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        self.y_test = dfTest['label'].tolist()
        from sklearn.metrics import accuracy_score
        from sklearn.metrics import confusion_matrix
        print(accuracy_score(pred,self.y_test))
        
        print("confusion matrix:")
        print(confusion_matrix(pred,self.y_test))

In [16]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()

train CV: 0.615071 (+/- 0.038506)
0.6304849884526559
confusion matrix:
[[ 50   5   0   0   5   0   1   6   0   0   2   0   0   0   0]
 [  3  55   1   1   5   3   5   5   1   1   0   0   1   2   0]
 [  0   0  15   0   0   0   0   0   0   0   0   0   0   0   1]
 [  6  14   5 173  10  10   9  22   2   7   8   2   6   6   6]
 [  2   1   4   3  56   1   4   3   2   0   1   1   5   0   3]
 [  0   0   0   0   1  22   1   0   0   0   0   0   0   0   0]
 [  1   1   0   0   0   1  14   1   0   0   4   0   0   0   0]
 [  8  11   4  17  10  12  11  84   0   7   4   3   2   1   2]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  2   1   0   0   0   1   0   0   0   4   0   0   0   0   0]
 [  1   0   0   3   0   0   0   0   0   5  11   0   1   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   1   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   1   0   0   0   0   0   0  12   0]
 [  2   1   0   1   1   0   3   1   0   0   1  