# Feature Engineering Homework 
***
**Name**: $<$Cory Mosiman$>$ 

**Kaggle Username**: $<$CoryMosiman$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [11]:
class FeatEngr:
    def __init__(self):
        
        from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
        
#         TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'
        self.vectorizer = CountVectorizer(stop_words = 'english', 
#                                           token_pattern=TOKENS_ALPHANUMERIC,
                                          ngram_range = (1,4))
        self.tfidf = TfidfTransformer()
        self.tfidf2 = TfidfTransformer()
        self.alpha_num = AlphaNumTransformer()

    def build_train_features(self, examples, subset = False):
        """
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a list of forum posts  
        """

#         subset = 20
        
        omdb_df = self.omdb_features(examples, subset = subset)
#         print('Examples shape before: {} omdb_df shape before: {}'.format(examples.shape, omdb_df.shape))
        omdb_df.to_csv('omdb_df.csv')
#         print('omdb_df index: {}, columns {}'.format(omdb_df.index, omdb_df.columns))
#         print('examples index: {}, columns {}'.format(examples.index, examples.columns))
        examples = examples.join(omdb_df, on = 'page')
        examples.to_csv('line_49_examples.csv')
#         print('Examples shape after: {} '.format(examples.shape))
        examples = examples.drop(['unique_sep'], axis = 1)
#         print(examples.info())
        return examples

    def month_to_month(self, a):
#         print('month_to_mohth')
        months_map = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr':4, 'May': 5, 'June': 6, 'July': 7,
                 'Aug':8, 'Sep': 9, 'Oct':10, 'Nov':11, 'Dec': 12, 'None': np.nan}
        try:
            return months_map[a]
        except:
            return a

    def separate_titles(self, column):
        import re
        '''
        Used to separate the title at capital letters for easier passing into get request.
        '''
        sep_column = re.findall('[A-Z]{1}[a-z]{1,}[^A-Z]*', column)
        if len(sep_column) > 1:
            return(' '.join(sep_column))
        else:
            return(column)

    def omdb_features(self, examples, subset):
        '''
        This method is meant to find additional features from the Open Movie Database (OMDB).
        An API key was obtained and used to get potentially insightful additional features by using
        the movie title.
        '''

        unique_titles = pd.DataFrame({'unique_unsep':pd.unique(examples['page'])})
        unique_titles['unique_sep'] = unique_titles['unique_unsep'].apply(self.separate_titles)
        unique_titles.set_index('unique_unsep', inplace = True)

        print('There are {} unique titles'.format(unique_titles.shape[0]))

        # Define what we would like to get back from OMDB
        self.new_feat_keys = ['Runtime', 'Released','imdbVotes', 'imdbRating',
                         'Genre', 'Rated', 'Type']
#         self.new_feat_keys = ['Runtime', 'imdbVotes','imdbRating','Genre', 'Type']

        # use subset if testing API
        if subset:
            print('Taking a subset of {} from the total number of unique_titles'.format(subset))
            unique_titles = unique_titles.sample(subset)

        # instantiate new dataframe
        new_feat_df = pd.DataFrame(columns = self.new_feat_keys)

        # make API call, 
        sample2 = unique_titles['unique_sep'].apply(self.call_omdb)
#         print(type(sample2), sample2.head())

        ###
        # call returns 
        new_feat_df = pd.concat(sample2.tolist(), ignore_index=True)

        # create 3 new columns from released date for Day, Month, Year
        if 'Released' in self.new_feat_keys:
            new = pd.DataFrame(new_feat_df['Released'].str.split(' ').values.tolist()).astype('object')
            new.columns = ['num_day','num_month','num_year']
            new['num_month'] = new['num_month'].apply(self.month_to_month)

            # replace 'None' with NaN
            new.fillna(value=np.nan, inplace = True)
            new_feat_df = pd.concat([new_feat_df, new], axis = 1)
            
        new_feat_df.index = unique_titles.index
        new_feat_df.to_csv('line_119_new_feat_df.csv')
        
        print('shape of sample: {}\shape of new_feat_df: {}'.format(unique_titles.shape, new_feat_df.shape))
        final = pd.concat([unique_titles, new_feat_df], axis = 1)

        # convert desired columns to numeric data type
        numeric_cols = ['imdbVotes', 'Runtime','imdbRating','num_day','num_month','num_year']
        self.numeric_cols_current = list(set(self.new_feat_keys).intersection(numeric_cols))
        final[self.numeric_cols_current] = final[self.numeric_cols_current].apply(lambda x: pd.to_numeric(x.astype(str).str.replace(',',''), 
                                                                                errors='coerce'))
        
        final.to_csv('line_130_final_df.csv')
        # combine all text data into one 'text' column, dropping others
        text_cols = ['Genre', 'Rated','Type', 'Released']
        self.text_cols_current = list(set(self.new_feat_keys).intersection(text_cols))
        print('self.text_cols_current type: {}'.format(type(self.text_cols_current)))
        if 'Genre' in self.text_cols_current:
            print('Genre in self.text_cols_current')
            final['Genre'] = final['Genre'].str.replace(', ', ' ')
            final.to_csv('line_137_final.csv')
       
        final['omdb_features_text'] = final[self.text_cols_current].apply(lambda x: ' '.join(x), axis = 1)
        final.to_csv('line_140_final.csv')
        print('final type: {}'.format(type(final)))
#         final = final.drop(self.text_cols_current, axis = 1)

#         from datetime import datetime
        
#         save_file = datetime.now().strftime('%Y-%m-%d %H_%M') + 'api_response.csv'
#         final.to_csv(save_file)
        return final


    def call_omdb(self, title):
        '''
        This method actually performs the call to the OMDB API, checks if the response is valid 
        and has content, and returns the potentially useful features defined by ```find_these```.
        '''
        import requests
        import os
        import json

        omdb_api_key = os.environ['OMDB_API_KEY']
        omdb_base_url = 'http://www.omdbapi.com/'
        parameters = {'apikey': omdb_api_key,
                     't': title,
                     'r': 'json'}
        response = requests.get(omdb_base_url, params = parameters)

        if response.status_code != 200:
            return_this = ['NaN']*len(self.new_feat_keys)
            return pd.DataFrame([return_this], columns=self.new_feat_keys)

        elif response.status_code == 200:
            resp_dict = json.loads(response.text)
            if resp_dict['Response'] == 'True':
                try:
                    resp_dict['Runtime'] = resp_dict['Runtime'].split(' ')[0]
                except:
                    pass
                new_feat_values = [resp_dict[feature] for feature in self.new_feat_keys]
                return pd.DataFrame([new_feat_values], columns=self.new_feat_keys)

            else:
                return_this = ['NaN']*len(self.new_feat_keys)
                return pd.DataFrame([return_this], columns=self.new_feat_keys)
    
    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a list of forum posts  
        """
        omdb_df = self.omdb_features(examples, subset = subset)
        examples = examples.join(omdb_df, on = 'page')

        return self.vectorizer.transform(examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
    
    
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression
        from sklearn.model_selection import cross_val_score
        from sklearn.feature_selection import SelectKBest
        from sklearn.pipeline import Pipeline
        from sklearn.pipeline import FeatureUnion
        from sklearn.preprocessing import FunctionTransformer
        from sklearn.preprocessing import MaxAbsScaler
        from sklearn.preprocessing import StandardScaler
        from sklearn.preprocessing import Normalizer 
        from sklearn.preprocessing import QuantileTransformer
        from sklearn.preprocessing import Imputer
        
        # load data 
        dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.X_train = self.build_train_features(dfTrain.loc[:,['sentence', 'trope', 'page']])
        
        ####################################################################
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
#         self.logreg = LogisticRegression(random_state=random_state)
#         self.logreg.fit(self.X_train, self.y_train)
        ####################################################################
        
        self.y_train = np.array(dfTrain["spoiler"], dtype=int)
        
        # define simple lambda functions to extract str and float objects separately
        get_text = FunctionTransformer(lambda x: self.make_one_text(x[['sentence','omdb_features_text']]), 
                                       validate = False)
        get_nums = FunctionTransformer(lambda x: x[self.numeric_cols_current], validate = False)
        get_counts = FunctionTransformer(lambda x: self.get_counts(x), validate = False)
        get_text2 = FunctionTransformer(lambda x: self.make_one_text(x['sentence']), 
                                       validate = False)
        
        # Extract only text values, perform vectorizer and tfidf
        text_pipe1 = Pipeline([
            ('get_text', get_text),
            ('vectorizer', self.vectorizer),
            ('tfidf', self.tfidf)
        ])
        
        text_pipe2 = Pipeline([
            ('get_text',get_text2),
            ('alpha_nums', self.alpha_num),
#             ('tfidf2', self.tfidf2)
        ])
        
        # 
        count_pipe = Pipeline([
            ('count_pipe', get_counts),
            ('impute', Imputer()),
            ('scale', StandardScaler)
        ])
        # Extract only numerical values, perform pipelined operations
#         num_pipe = Pipeline([
#             ('get_nums', get_nums),
#             ('impute', Imputer()),
#             ('scale', MaxAbsScaler)
#         ])
        
        # create pipeline for model, perform union on text and numerical pipelines returns,
        # and pass to log reg
        self.logreg = Pipeline([
            ('union', FeatureUnion(
                transformer_list = [
                    ('count', count_pipe),
                    ('text1', text_pipe1),
#                     ('text2', text_pipe2)
#                     ('nums', num_pipe)
                ]
            )),
            ('scale', Normalizer()),
            ('log_reg', LogisticRegression(random_state=random_state))
        ])
        
        print('Right before self.logreg.fit: x_train_shape {}\
              y_train_shape {}'.format(self.X_train.shape, self.y_train.shape))
              
        self.X_train.to_csv('line_271_self_x_train.csv')
        self.logreg.fit(self.X_train, self.y_train)
        
        print('Train set accuracy: {}'.format(self.logreg.score(self.X_train, self.y_train)))
        scores = cross_val_score(self.logreg, self.X_train, self.y_train, cv = 5)
        print(scores)
        print("Mean Accuracy in Cross-Validation = {:.3f}".format(scores.mean()))
    
    def make_one_text(self, df):
        print('make_one_text_function')
        df.to_csv('line_279_make_one_text.csv')
        df['make_one_text'] = df.apply(lambda x: ' '.join(x), axis = 1)
        df = pd.DataFrame(df['make_one_text'], columns = ['make_one_text'])
        df.to_csv('line_283_make_one_text.csv')
        return list(df['make_one_text'])
        
    def get_counts(self, df):

        import re
        from scipy.sparse import csr_matrix
        
        # instantiate empty dataframe
        cols = ['double_spaces','non_alpha','total_length']
        counts_df = pd.DataFrame(columns = cols)
        
        # Count the number of double spaces
        counts_df['double_spaces'] = df['sentence'].apply(lambda x: len(re.findall(r'  ', x)))
        
        # count non alpha numeric characters
        counts_df['non_alpha'] = \
            df['sentence'].apply(lambda x: len([found for found in \
                                                re.findall(r'[^a-zA-Z0-9]', x) if not found == ' ']))
            
        # string length
        counts_df['total_length'] = df['sentence'].str.len()
        
        # number of genres present from APIcall
        counts_df['num_genres'] = \
            df['Genre'].apply(lambda x: len([found for found in \
                                             re.findall(r' ', x)]) + 1)
            
        counts_df['Runtime'] = df['Runtime']
#         counts_df['imdbVotes'] = df['imdbVotes']
#         counts_df['imdbRating'] = df['imdbRating']
#         counts_df.to_csv('line_298_get_counts.csv')
        print('counts_df.shape: {} counts_df.values.shape: {}'.format(counts_df.shape, 
                                                                      counts_df.values.shape))
#         print(list(counts_df.values))
        return counts_df
#         return list(counts_df.values)
#         return counts_df['Runtime']  NOPE
#         return counts_df.values.reshape(-1,1)  NOPE
#         return counts_df.reshape(-1,1)  NOPE
        
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        
        # read in test data 
        dfTest  = pd.read_csv("../data/spoilers/test.csv")
        
        # featurize test data 
        self.X_test = self.build_train_features(dfTest[['sentence', 'trope', 'page']])
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", 
                                                                     index=True, index_label="Id")
        
from sklearn.base import BaseEstimator, TransformerMixin

class AlphaNumTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, examples):
        # return self and nothing else 
        return self
    
    def transform(self, examples):
        
        import numpy as np 
        from scipy.sparse import csr_matrix
        
        alpha_nums = ['.', '!', '@', '#', '$','%','^','&','*','-',' ', '  ', '?']
         
        # Initiaize matrix 
        X = np.zeros((len(examples), len(alpha_nums)))
        
        # Loop over examples and count letters 
        for ii, x in enumerate(examples):
            X[ii,:] = np.array([x.count(alpha_nums) for alpha_num in alpha_nums])
            
        return csr_matrix(X) 
    

In [12]:
# Instantiate the FeatEngr class
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

# Shows the top 10 features for each class 
# feat.show_top10()

# Make prediction on test data and produce Kaggle submission file 
# feat.model_predict()

There are 679 unique titles
shape of sample: (679, 1)\shape of new_feat_df: (679, 10)
self.text_cols_current type: <class 'list'>
Genre in self.text_cols_current
final type: <class 'pandas.core.frame.DataFrame'>
Right before self.logreg.fit: x_train_shape (11970, 14)              y_train_shape (11970,)
counts_df.shape: (11970, 5) counts_df.values.shape: (11970, 5)


AttributeError: 'numpy.ndarray' object has no attribute 'fit'

### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

### Approach
***

#### OMDB API

After looking at the original dataset and understanding the problem at hand, I began researching spoilers.  I came across the paper _Spoiler Alert: Machine Learning Approaches to Detect Social Media Posts with Revelatory Information_, which discusses the role of additional metadata (Genre, Length, First Aired, Episodes, Country) for creating a more robust model.  They were able to boost their spoiler detection accuracy from 60%, using only unigrams and bigrams, to 67%, using the additional aforementioned features in addition to their original baseline model.  With that in mind, I began researching for access to movie metadata.  I found the [OMDB API](http://www.omdbapi.com/), which provides metadata information for movies, shows, etc.  A sample response from their API is below:  
```json
{
    "Title": "America's Funniest Home Videos",
    "Year": "1989–",
    "Rated": "TV-PG",
    "Released": "26 Nov 1989",
    "Runtime": "30 min",
    "Genre": "Comedy, Family, Reality-TV",
    "Director": "N/A",
    "Writer": "N/A",
    "Actors": "Jess Harnell, Tom Bergeron, Bob Saget, Ernie Anderson",
    "Plot": "Viewers from around America send in home videos with comedic moments.",
    "Language": "English",
    "Country": "USA",
    "Awards": "4 wins & 6 nominations.",
    "Poster": "https://images-na.ssl-images-amazon.com/images/M/MV5BMTY3MDkzMDE4Nl5BMl5BanBnXkFtZTcwMjM3ODQzMQ@@._V1_SX300.jpg",
    "Ratings": [
        {
            "Source": "Internet Movie Database",
            "Value": "6.2/10"
        }
    ],
    "Metascore": "N/A",
    "imdbRating": "6.2",
    "imdbVotes": "4,187",
    "imdbID": "tt0098740",
    "Type": "series",
    "totalSeasons": "27",
    "Response": "True"
}```

Using the suggestions from the _Spoiler Alert_ paper, as well as what I thought might be useful, I decided to extract the following features:

```python
new_features = ['Runtime', 'Released','imdbVotes', 'imdbRating','Genre', 'Rated', 'Type']
```

> **Runtime** (numeric): Potentially shows (28-30 mins) are more/less susceptible to spoilers compared to films (70-120 mins).  
> **Released** (numeric): The resesarchers found that newer movies were more likely to have spoilers than older movies.  
> **imdbVotes** (numeric): Potentially movies with more/less votes are more spoiled.  
> **imdbRating** (numeric): Same as above.  
> **Genre** (text): Could the genre of the movie effect the spoiler rate?  
> **Rated** (text): Potentially R-rated movies are spoiled more often than G-rated movies.  
> **Type** (text): Maybe movies are more often spoiled than shows.

I wrote a few functions to ingest, sort, and process this data from the API.  One of the shortcomings of the method is that, depending on the syntax, some movies/shows did not have data available.  In those cases, all of the fields were labeled with `NaN`.

#### Feature Engineering
***
**`CountVectorizer()`**  

Using the concepts learned from the _Lecture 3: Logistic Regression and Text Models_ notebook, I decided to use a bag-of-words approach with the ```CountVectorizer()``` class, which creates a sparse representation of a term-frequency matrix.  The point of a term-frequency matrix is to scan a corpus and come up with a matrix representation of the associated words in the corpus.  It is simply the application of a one-hot-encoding scheme to strings, texts, etc., with a higher level API for implementing n-grams, stop-words, and other features.  

***

**`TfidfTransformer()`**

The ```TfidfTransformer()``` class stands for term-frequency, inverse document frequency.  It is similar to ```CountVectorizer()``` in that it creates a bag-of-words (the term-frequency part) model for each document in a corpus. However, instead of representing only the count of each word in the corpus, it adjusts this with the following:  

$$
\texttt{idf(t)} = \ln ~ \frac{\textrm{total # documents}}{\textrm{1 + # documents with term }t}
 = \ln ~ \frac{\left|~D~\right|}{1 + \left|\{~d: ~ t \in d\}~\right|}
$$

Basically, it is calculating the importance of each individual word based on its frequency in the corpus.

***

**`AlphaNumTransformer()`**

I created the ```AlphaNumTransformer()``` to basically create a one-hot-encoding scheme for all non-alphanumeric transformers, which I thought might provide additional information to the model.  This class was instantiated based on the ```XYZTransformer()``` class from the _Lecture 7: Feature Engineering_ notebook.  For instance, perhaps spoilers are more likely to use exclamation points in their comments.  The investigated alpha numeric characters are below.

```python
alpha_nums = ['.', '!', '@', '#', '$','%','^','&','*','-',' ', '  ', '?']```

***

**`get_counts()`**

The original intention of the get_counts method was to extract the numerical data from the OMDB API response. The intention was for it to apply `lambda` functions to different elements of the dataframe to extract the following numerical/count features:  

* Runtime (**unsuccessful**)
* imdbVotes (**unsuccessful**)
* imdbRating (**unsuccessful**)
* comment length (successful)
* alphanumeric characters (successful)
* number of genres represented (successful)
* number of double spaces (successful)
    
These were the original, and it was my plan to add more, however, I spent too much time trying to trouble shoot why I was unable to get all the above to work.

### Understanding the Code and Different Objects

The above approach was implemented using both the ```Pipeline()``` and ```FeatureUnion()``` classes from the ```sklearn.pipeline``` library. A discussion of the different scikit learn objects is provided below.

> [Estimators](http://scikit-learn.org/stable/tutorial/statistical_inference/settings.html#estimators-objects) are the main API implemented by scikit-learn for any object that learns from data.  An estimator will always have a `.fit()` and `.transform()` method, which takes a dataset (typically a 2-d array).  An estimator object can be any of the following: classification, regression, clustering, OR a transformer that extracts or filters useful features from the raw data.

> [BaseEstimator](http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) is the base class required for all estimators.  It consists of `get_params` and `set_params`.

> [Transformers](http://scikit-learn.org/stable/data_transforms.html) are a type of estimator that have both `.fit()` and `.transform()` methods.  They are used for the following:

> * Preprocess (i.e. normalization, standardization, imputation, etc.)
> * Reduce (i.e. PCA)
> * Expand (i.e. kernel approximation)
> * Generate feature representations (i.e. feature extraction).

> [TransformerMixin](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) basically just adds the `.fit_transform()` method to any transformer.

> [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators)( ) provides a way to chain multiple estimators (which include transformers) together.  Basically exposes the `.fit()` and `.transform()` methods of each of the underlying estimators into a single interface.

> [FeatureUnion](http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces)( ) provides a very simple method for joining features going through separate transformation steps, such as would happen for text data compared to numerical data.  For instance, text data would likely go through a `CountVectorizer()` to generate a feature representation of the different words, and then maybe a `TfidfTransformer()`.  However, numerical data would likely go through a a preprocessing step for standardization or normalization of the features.

### Issues

I encountered some issues that I was unsuccessful in resolving throughout the course of the assignment.  In particular, and what was most disappointing, I was unable to use any of the numerical data gathered from the OMDB API (as can be observed from all of the commented sections in the ```get_counts()``` method).  The main reason for writing the above section was to try to better understand why I was getting the following errors:

`numpy ndarray object has no attribute fit`   
`fit not found`  
`ValueError: Expected 2D array, got 1D array instead...Reshape your data using array.reshape(-1,1)...`

I still don't fully understand why my get_counts method works for some features but not others.  Although I now realize that an object in a pipeline should have both a `.fit()` and `.transform()` method in order to run successfully.

### Error Analysis

Unfortunately, I spent too much time trying to get the numeric features to work that I didn't implement any solid error analysis into my approach.  

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 