# Feature Engineering Homework 
***
**Name**: $<$Xiaolan Cai$>$ 

**Kaggle Username**: $<$xiaolan cai$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split

import numpy as np
from numpy import array
import random
import nltk
#nltk.download('punkt')
import re

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree
from nltk.util import ngrams
#nltk.download('averaged_perceptron_tagger')

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [15]:
class FeatEngr:
    def __init__(self):

        #baseline feature
        #self.vectorizer = CountVectorizer()

        estimators = [('bag-of-words',Pipeline([
                                                ('extract-field', FunctionTransformer(lambda x: x[0], validate = False)),
                                                ('tfidf', TfidfVectorizer(analyzer = "word",ngram_range = (1,2),max_df=0.5))
                                                ])),

                        ('type-of-trope', Pipeline([
                                                 ('extract-field', FunctionTransformer(lambda x: x[1], validate = False)),
                                                 ('tf', TfidfVectorizer())
                                                ])),

                       
                        #('name-of-page', Pipeline([
                         #                       ('extract_field', FunctionTransformer(lambda x: x[2], validate = False)), 
                          #                      ('page', TfidfVectorizer())
                           #                     ])),
                        
                        #('baseline', Pipeline([
                         #                     ('extract-field', FunctionTransformer(lambda x: x[0], validate = False)),
                          #                    ('countvec',CountVectorizer())
                           #                   ]))
                
                                                ]
                        
        featureunion = FeatureUnion(estimators)
       
        self.vectorizer = featureunion
        #self.vectorizer =  TfidfVectorizer( ngram_range = [1,2], max_df=0.5,stop_words = "english")
        #self.vectorizer =  TfidfVectorizer()
        #self.vectorizer = PageTransformer()

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering
        Most of the work in this homework will go here, or in similar functions
        :param examples: currently just a list of forum posts
        """
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features
        :param examples: currently just a list of forum posts
        """
        return self.vectorizer.transform(examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the
        top 10 features for the negative class.
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-30:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:30]
        print("Pos: %s" % ", ".join(feature_names[top10]))
        print("Neg: %s" % " ,".join(feature_names[bottom10]))

    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and
        train Logistic Regression classifier.

        :param random_state: seed for random number generator
        """       

        # load data
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        

        #imdb = pd.DataFrame.from_csv("../data/spoilers/title.basics.tsv",sep='\t')
        #eqguide = pd.read_csv("../data/spoilers/allshows.csv")

        shape = dfTrain.shape
        #print(list(dfTrain["trope"]))
        # get training features and labels
        #print(list(dfTrain["spoiler"]).count(False))

        post = []
        print()
        #for sentence in list(dfTrain["sentence"]):
        text = word_tokenize(dfTrain["sentence"][0])

        text = pos_tag(text)
        print(text[0][1])        

        """
        ####error analysis###
        mytrain, mytest = train_test_split(dfTrain, test_size=0.2, shuffle=False, random_state=1230)
        self.X_train = self.build_train_features([list(mytrain["sentence"]),list(mytrain["trope"]),list(mytrain["page"])])
        #self.X_train = self.build_train_features(list(mytrain["sentence"]))

        self.y_train = np.array(mytrain["spoiler"], dtype=int)

        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)
        self.X_test = self.get_test_features([list(mytest["sentence"]),list(mytest["trope"]),list(mytest["page"])])
        #self.X_test = self.get_test_features(list(mytest["sentence"]))

        pred = self.logreg.predict(self.X_test)

        misclassified = pd.Series(np.array(mytest['spoiler']) != pred)
        print(mytest[misclassified.values][['spoiler', 'sentence','trope']])
        #np.savetxt(r'../data/spoilers/err.txt', mytest[misclassified.values][['spoiler', 'sentence']], fmt='%d')
        ###error analysis###

        """
        ###train the model
        #self.X_train = self.build_train_features(list(dfTrain["sentence"]))
        self.X_train = self.build_train_features([list(dfTrain["sentence"]),list(dfTrain["trope"]),list(dfTrain["page"]), post])
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)

       
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!!
        self.logreg = LogisticRegression(random_state=random_state)
        self.logreg.fit(self.X_train, self.y_train)

        
        #do 5-fold cross validation
        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv=5)
        print("Accuracy: %0.5f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

        y_pred = cross_val_predict(self.logreg,self.X_train,self.y_train,cv=5)
        conf_mat = confusion_matrix(self.y_train,y_pred)
        print(conf_mat)
        


    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file
        """

        # read in test data
        dfTest = pd.read_csv("../data/spoilers/test.csv")


        # featurize test data
        self.X_test = self.get_test_features([list(dfTest["sentence"]),list(dfTest["trope"]),list(dfTest["page"])])

        # make predictions on test data
        pred = self.logreg.predict(self.X_test)

        # dump predictions to file for submission to Kaggle
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")
        
        

In [16]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

# Shows the top 10 features for each class 
#feat.show_top10()

# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()


IN
Accuracy: 0.67986 (+/- 0.02)
[[4109 1573]
 [2259 4029]]


### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

### Result analysis

|| Feature              | Accuracy with 5-fold CV |
|---| ------------------| ----------------------- |
|0| Baseline feature    | 0.61186 (+/- 0.03)      |
|1| type of trope with TF-IDF    |0.62882 (+/-0.03)  |
|2| bag-of-words of sentence with TF-IDF    | 0.65246 (+/- 0.03)  |
|3| feature union of 1 and 2    | 0.67986 (+/- 0.02)  |

#### 0. baseline feature

The baseline feature is using the bag-of-words for each sentence. From the top features, we can see the features that contribute most to classifications tend to be people's names simply because they are mentioned more frequently.

##### best features:
Pos: tear freya dies harvey sebastian regina morgana olivia moriarty destiny

Neg: cory johnny tim drew often hilarious meant cody disney fed

##### error analysis example:

||sentence|spoiler|Page|trope|
|---|----|----|---|
|1.|Each of the major houses could be considered one, what with their long and complex histories, tangled branches, sigils and mottoes, and similar looks,  which become a plot point concerning Joffrey's parentage .|	TRUE	|GameOfThrones	|TheClan|
|2.|Something in the way that John Mahoney breaks into a snort of laughter and the way that Jane comes flouncing back and then on the correct route to Daphne's room.	|FALSE	|Frasier|	FailedAttemptAtDrama|

In the two statements, the possible reasons that they are misclassified is that they contains people's names such as "John" in the second statement, and it appears more often in True spoiler so that the second statement is classified as True but as a matter of fact, it is not a spoiler. 

#### 1. type pf trope with TF-IDF
The first feature I used is one-hot-encoding on trope, using the trope as a sementic category of the line， to figure out if a trope appears more often in a True spoiler or False Spoiler. And use Tf-IDF vetorizer to conver the text to a matrix of TF-IDF features. The accuracy I get using 5-fold cross validation is 0.62882(+/-0.03). It is slightly improving the baseline feature.

##### best features

Pos: foreshadowing, nicejobbreakingithero, heroicsacrifice, driventosuicide, thereveal, bittersweetending, themole, 

xanatosgambit, anyonecandie, ohcrap, whamepisode


Neg: sitcom ,shirtlessscene ,screwedbythenetwork ,thebbc ,doubleentendre ,seriousbusiness ,nicehat 

,rippedfromtheheadlines ,malaproper ,fakenationality ,jerkjock ,metaguy ,stayinthekitchen

##### error analysis example:

||sentence|spoiler|Page|trope|
|---|----|----|---|
|1.|Detective Ash Who turns out to be  one of the 113 .	|TRUE	|Brimstone	|ColonelMakepeace|
|2.|One of the best was in  Waldorf Salad , wherein he tries to charm the attractive lady at the desk while pointing out the obnoxious American tourist as typical of the "rubbish" they usually get.	|FALSE	|FawltyTowers	|BlatantLies|

In the first statement, we can see verb phrase like "turn out", which is a good clue for a spoiler alert! But it was predicted as "False" due to the fact that the trope "ColonelMakepeace" appears only once in the test set and did not appear in the training set.  In the second statement, trope name "BlatantLies" appeare several times in the sentences, and more often in True spoiler. That's the reason why the sentence was predicted as True.

Base on the reasoning, I think analyzing the words in the sentences can give better features.



#### 2. bag-of-words of sentence with TF-IDF

I use Tf-Idf vertorizer with n-gram = 1-2, stop word = "english", max df=0.5. This feature will use single word tokens from the original sentence and the consecutive pairs of tokens. It can capture important contextual information that is missed in the “bag of words” model.

##### example:

||sentence|spoiler|Page|trope|
|---|----|----|---|
|1.|Detective Ash Who turns out to be  one of the 113 .	|TRUE	|Brimstone	|ColonelMakepeace|

In the first feature, this sentence was misclassified as False, but with Tf-idf vectorizer and bigram, the verb phrase " turns out" is extracted and therefore can detect this sentence to be a spoiler. 

##### best features:
Pos: regina, love, save, michael, killing, shot, ending, lex, destiny, morgana, season finale, moriarty, ends, season, died, kill, olivia, peter, sherlock, die, dies, actually, dead, kills, finale, revealed, end, turns out, death, killed

Neg: usually ,cory ,like ,cast ,tv ,tim ,meant ,seasons ,character ,examples ,frasier ,television ,episodes ,frequently ,times ,word ,ross ,role ,provides ,beat ,buffy ,version ,remember ,chandler ,actor ,series ,writers ,characters ,lois ,later seasons

From the features above, we can see for both the unigrams and bigrams, useful features. Both highly transitive verbs “kill”, “killing”, “killed” and temporal expressions such as "end", "ending".

##### error analysis example:

||sentence|spoiler|Page|trope|
|---|----|----|---|
|1.|In Canada, this episode was allowed to air, but the part where Lance Prevert tries to give his adopted kid back to the agency, only to learn that "adoption is for life," had Lance's line "Damn bureaucrat!	|FALSE	|YouCantDoThatOnTelevision|	MissingEpisode|

In this example, this is actually complain from the author that the episode is not allowed to air, but is misclassified as a spoiler. One way to solve such situation is to combine the "type of trope" feature with N-gram Tf-Idf feature, and use the feature union to model the data. 

#### 3. feature union

I use the sklearn pipeline to union the two features that I found is improving the accuracy above baseline. And get a better accuracy score of 0.67986 (+/- 0.02).

##### error analysis example:

||sentence|spoiler|Page|trope|
|---|----|----|---|
|1.|Xena killed  most of the Olympian Gods, a couple non-Olympian Gods, Mephistopheles the King of Hell, the demon Yodoshi, a super-powered Alti, the archangel Michael and stopped the ultimate evil Dahak numerous times .	|TRUE	|XenaWarriorPrincess|	DidYouJustPunchOutCthulhu|

For the above sentence, it contains the transitive word "killed", however, it is still misclassified as False spoiler. I think the reason is that with the unigram and trope features, it is still good enough. In my work, the "page" column is not used to extract features because I found the page feature is not improving the overall accuracy. But "page" can be used to match with some matadata such as TV Genre, country, air year, length from website such as IMDB or Epguide.com. For the future work, these extra feature can be added to the dataset. 







         

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 