# Feature Engineering Homework 
***
**Name**: Akshit Arora

**Kaggle Username**: akshitarora
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [2]:
import nltk
import sklearn
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from string import punctuation
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

#defining stemmer and stopwords list
ps = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
    temp = (ps.stem(w) for w in analyzer(doc))
    temp2 = [word for word in temp if word not in punctuation]
    return temp2
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)
stoplist_combined = set.union(stopwords_nltk_en, stopwords_punct)

Querying the imdb dataset for useful features (like genre), make train2.csv and test3.csv files with an additional column containing string for all genres.

In [9]:
dfimdb = pd.read_csv('title.basics.tsv', sep='\t', header=0) #source: http://www.imdb.com/interfaces/
dfTrain2 = pd.read_csv("train2.tsv",sep='\t') #preprocessed from comma separated to tab separated file
dfTest3 = pd.read_csv("test3.tsv",sep='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
# 'GameOfThrones' -> 'Game Of Thrones'
from re import finditer
def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]

In [None]:
#preprocessing and adding genre column to Training set
gen_dict = {}
import math
for index,row in dfTrain2.iterrows():
    temp_name = ' '.join(camel_case_split(row["page"]))
    if pd.isnull(row["genre"]):
        if temp_name in gen_dict:
            dfTrain2.genre.iloc[index] = gen_dict[temp_name]
        else:
            temp_genre = "unknown"
            if temp_name in dfimdb.values:
                list_of_idx = dfimdb.index[dfimdb['primaryTitle'] == temp_name].tolist()
                temp_genre_list = list([ str(dfimdb.iloc[i]['genres']) for i in list_of_idx])
                temp_genre = ','.join(temp_genre_list)
            
            gen_dict[temp_name] = temp_genre
            dfTrain2.genre.iloc[index] = gen_dict[temp_name]

In [None]:
#getting out repeated genres
import math
for index,row in dfTrain2.iterrows():
    if(row["genre"] != "unknown"):
        if(type(row["genre"]) == float):
            dfTrain2.genre.iloc[index] = "unknown"
        else:
            dfTrain2.genre.iloc[index] = ' '.join(set(row["genre"].split(',')))
    print(dfTrain2.genre.iloc[index])

In [11]:
#saving dfTrain to tab separated format!
dfTrain2.to_csv('train2.tsv',sep='\t',index=False)
dfTrain2.to_csv('train2_2.tsv',sep='\t',index=False) #backup

In [None]:
#making similar genre column for test set
gen_dict2 = {}
import math
for index,row in dfTest3.iterrows():
    temp_name = ' '.join(camel_case_split(row["page"]))
    if pd.isnull(row["genre"]):
        if temp_name in gen_dict2:
            dfTest3.genre.iloc[index] = gen_dict2[temp_name]
        else:
            temp_genre = "unknown"
            if temp_name in dfimdb.values:
                list_of_idx = dfimdb.index[dfimdb['primaryTitle'] == temp_name].tolist()
                temp_genre_list = list([ str(dfimdb.iloc[i]['genres']) for i in list_of_idx])
                temp_genre = ','.join(temp_genre_list)
                if(type(temp_genre) == float):
                    temp_genre = "unknown"
            if(temp_genre != "unknown"):
                if(type(temp_genre) == float):
                    temp_genre = "unknown"
                else:
                    temp_genre = ' '.join(set(temp_genre.split(',')))
            
            gen_dict2[temp_name] = temp_genre
            dfTest3.genre.iloc[index] = gen_dict2[temp_name]

In [26]:
#saving test data
dfTest3.to_csv('test3.tsv',sep='\t',index=False)
dfTest3.to_csv('test3_2.tsv',sep='\t',index=False) #backup

In [59]:
#source: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [3]:
#source: Feature Union with Heterogeneous Data Sources: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html 
from sklearn.base import BaseEstimator, TransformerMixin
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to scikit-learn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]

In [88]:
from sklearn.base import BaseEstimator, TransformerMixin

class XYZTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import nltk
        import sklearn
        X = []
        for ii, x in enumerate(examples):
            temp = nltk.pos_tag(x)
            app = []
            for j in temp:
                app.append(j[1])
            X.append(' '.join(app))
            
        vect = sklearn.feature_extraction.text.TfidfVectorizer(analyzer='word')
        gh = vect.fit_transform(X) 
        return gh

In [89]:
from sklearn.decomposition import TruncatedSVD,PCA
from sklearn.feature_selection import SelectKBest, chi2
class FeatEngr:
    def __init__(self):
        from sklearn.pipeline import Pipeline
        from sklearn.feature_extraction.text import CountVectorizer
        import nltk
        from sklearn.pipeline import FeatureUnion

        allmyfeatures = FeatureUnion(transformer_list=[
            #apply tfidf on sentence
                ("tfidf",Pipeline([
                    ('selector',ItemSelector(key='sentence')),
                    ('tfidf',sklearn.feature_extraction.text.TfidfVectorizer(analyzer = stemmed_words, ngram_range=(1, 1) ,use_idf=True, smooth_idf=True, sublinear_tf=False))
               ])), 
            ("tfidf2",Pipeline([
                    ('selector',ItemSelector(key='genre')),
                    ('tfidfG',sklearn.feature_extraction.text.TfidfVectorizer(analyzer = 'word', ngram_range=(1,3),use_idf=True, smooth_idf=True, sublinear_tf=False)),
                    ('best', TruncatedSVD(n_components=82))
                ])), 
            #applying count vectorizer on genres
            ("cvG", Pipeline([
                ('selector',ItemSelector(key='genre')),
                ('countVect', CountVectorizer(tokenizer=nltk.word_tokenize,analyzer='word'))
            ])),
            ("cvGdd", Pipeline([
                ('selector',ItemSelector(key='sentence')),
                ('countVect', XYZTransformer())
            ])),
            ("cvT", Pipeline([
                ('selector',ItemSelector(key='trope')),
                ('countVect2', CountVectorizer(analyzer='word',ngram_range=(1,1)))
                ])),
            ("cv2", Pipeline([
                ('selector',ItemSelector(key='sentence')),
                ('countVect3', sklearn.feature_extraction.text.HashingVectorizer (ngram_range=(1, 1),analyzer=stemmed_words,norm='l1',alternate_sign=False)),
                ('best', TruncatedSVD(n_components=92))
                ]))
        ])
        
        self.vectorizer = allmyfeatures

    def build_train_features(self, examples):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        return self.vectorizer.transform(examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
        
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        from sklearn.metrics import accuracy_score
        
        # read in test data 
        dfTest  = pd.read_csv('test3.tsv',sep='\t')
        
        # featurize test data 
        #dfTest['sentence1'] = dfTest[['trope', 'sentence','genre']].apply(lambda x: ' '.join(x), axis=1)
        self.X_test = self.get_test_features((dfTest[["sentence","trope","genre","page"]]))
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        y_test = np.array(dfTest["spoiler"], dtype=int)
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")
        print("Predictions generated.")
        acc_train = accuracy_score(y_test,pred)
        print("accuracy on train set: " + str(acc_train))
    
    def train_model_cv(self, random_state=None, train_split=0.8, cv=True):
        """
        Method to read in training data from file, random splitting to cross validation-train set and 
        train Logistic Regression classifier on just the training set. 
        
        :param random_state: seed for random number generator
        """
        from sklearn.linear_model import LogisticRegression 
        from sklearn.metrics import accuracy_score
        from sklearn.metrics import confusion_matrix
        import random
        
        # load data 
        dfTraining = pd.read_csv("train2.tsv",sep='\t')
        
        #create random split into train and test set
        msk = np.random.RandomState(random_state).rand(len(dfTraining)) < train_split
        #dfTraining['sentence1'] = dfTraining[['trope', 'sentence','genre']].apply(lambda x: ' '.join(x), axis=1)
        # get training features and labels 
        X_training = self.build_train_features(dfTraining[["sentence","trope","genre","page"]])
        y_training = np.array(dfTraining["spoiler"], dtype=int)
        
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        self.logreg = LogisticRegression(random_state=random_state)
        if(cv):
            X_cv = X_training[~msk]
            y_cv = y_training[~msk]
            self.X_train = X_training[msk]
            self.y_train = y_training[msk]
            self.logreg.fit(self.X_train, self.y_train)
            #get valid set features and labels
            pred_cv = self.logreg.predict(X_cv)
            acc_cv = accuracy_score(y_cv,pred_cv)
            print("accuracy on cv set: " + str(acc_cv))
            pred_train = self.logreg.predict(self.X_train)
            acc_train = accuracy_score(self.y_train,pred_train)
            print("accuracy on train set: " + str(acc_train))
        else:
            self.logreg.fit(X_training, y_training)
            print("number of features: "+str(X_training.todense().shape))
            pred_train = self.logreg.predict(X_training)
            acc_train = accuracy_score(y_training,pred_train)
            print("accuracy on train set: " + str(acc_train))
#         plt = plot_learning_curve(self.logreg, "Learning Curves for logreg", X_training, y_training,ylim=(0.0,1.0))
#         plt.show()

In [92]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model_cv()

accuracy on cv set: 0.7613305613305613
accuracy on train set: 0.9066387872451647


In [91]:
# to produce kaggle submission file
feat2 = FeatEngr()
feat2.train_model_cv(random_state = 1230, cv=False)
feat2.model_predict()

number of features: (11970, 18099)
accuracy on train set: 0.899749373433584


ValueError: X has 18097 features per sample; expecting 18099

### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

**A**: For solving the problem above, I systematically tried including new features from the given dataset and after a while felt the need of adding additional information from internet about individual data points. Below is a detailed timeline of how I went about achieving the accuracy of predictions I submitted on kaggle. 

### 1. Check if the data is balanced or not.

In [44]:
# load data 
dfTrain = pd.read_csv("train.csv")
y_train = np.array(dfTrain["spoiler"], dtype=int)
spoiler = 0
not_spoiler = 0
for i in range(len(y_train)):
    if(y_train[i]==1):
        spoiler = spoiler+1
    else:
        not_spoiler = not_spoiler + 1
print("number of spoiler examples in the whole set: " + str(spoiler))
print("number of not spoiler examples in the whole set: " + str(not_spoiler))
print("proportion of spoiler to not spoiler examples: " + str(spoiler/(spoiler+not_spoiler)*100) + " : "+str(not_spoiler/(spoiler+not_spoiler)*100))

number of spoiler examples in the whole set: 6288
number of not spoiler examples in the whole set: 5682
proportion of spoiler to not spoiler examples: 52.531328320802004 : 47.468671679197996


The dataset is balanced since we have almost equal examples of both spoiler and not spoiler examples (52:47 ratio).

### 2. Using Bag-of-words on sentences

**What I did?**: For encoding sentences to feed the logistic regression model, I first tried using bag-of-words (BoW) model for sentences. I used sklearn's CountVectorizer for this purpose. I implemented cross-validation by splitting the training data randomly into train and cross-validation (CV) set (80%-20% split, stays the same throughout my experiments) and ran every experiment around 10 times to get an idea of range of accuracy of training and test sets. I also removed stop words (including punctuations) and normalized the features before feeding it to the model.<br>
**Why I did it?**: BoW allows us to count how many times a word appears in a document. Those word counts allow us to compare sentences and gauge their similarities. BoW measures frequencies.<br>
**What went wrong?**: My training set accuracy was always above 98% but CV set was between 50-60% => implying overfitting. In order to handle this overfitting, I went on exploring other features that can be extracted from text.

### 3. Using TF-IDF on sentences

**What I did?**: Instead of using CountVectorizer I used sklearn's TfidfVectorizer.<br>
**Why I did it?**: TF-IDF measures the number of times that words appear in a given document (that’s term frequency). But because words such as “and” or “the” appear frequently in all documents, those are systematically discounted. That’s the inverse-document frequency part. The more documents a word appears in, the less valuable that word is as a signal. That’s intended to leave only the frequent AND distinctive words as markers.<br>
**Intuition**: Capturing relevance of words in a sentence instead of just the frequencies. <br>
**Result?**: My training set accuracy went down to 82% and my CV accuracy went up to 67%. This indicates that overfitting has reduced and my model is now able to generalize better. 

### 4. Using CountVectorizer on Tropes

**What I did?**: In order to give more information to model about the data, I started looking at other columns provided in the dataset like tropes.<br>
**Why I did it?**: There were 2 more columns: Trope, Page. I tried page, but the training accuracy -> 84% and CV accuracy-> 60%. But by using tropes, training set accuracy went up to 91% and CV accuracy went up to 75%. Therefore I chose tropes.<br>
**Intuition**: The model was suffering from high bias since both training and test accuracies were down before. Therefore I started to look for new features.
**Result?**: Training set accuracy 91% and CV accuracy 74%. 

### 5. Using CountVectorizer on Genres from IMDB dataset

**What I did?**: Pre-processed training set to include genres from imdb dataset.<br>
**Intuition**: Genres like Talk Shows, Game Shows, Reality-TV will naturally not have as many spoilers as Thriller, Drama, Crime, Sci-fi. Therefore, such information will help classify the sentence for whether it will have a spoiler or not. <br>
**Result?**: Training set accuracy 90% and CV accuracy 74%. 

### 6. Using TF IDF on Genres

**What I did?**: Capturing only relevant genres instead of all of them.<br>
**Intuition**: I observed that there were multiple rows corresponding to the same primary title (name of the movie/tvshow) maybe because of different seasons of the show / differen version. Hence I consider only the genres that are relevant. <br>
**Result?**: Training set accuracy 91% and CV accuracy 75%. 

### 7. Dimensionality Reduction

**What I did?**: Capturing the features that matter the most using sklearn's truncated SVD.<br>
**Intuition**: I had 1067472 features by now and not all of them mattered as much.<br>
**Result?**: Dimensions reduced to: 18074. Training set accuracy 91% and CV accuracy 77%. 

For this homework, I have used code from couple of websites like scikit-learn's official website, IMDB website etc. The source URL's have been mentioned as comments throughout the notebook. 

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 