# [Paris Saclay Center for Data Science](http://www.datascience-paris-saclay.fr)

## [Fake news RAMP](http://www.ramp.studio/problems/fake_news): classify statements of public figures

_Emanuela Boros (LIMSI/CNRS), Balázs Kégl (LAL/CNRS), Roman Yurchak (Symerio)_

## Introduction
This is an initiation project to introduce RAMP and get you to know how it works.

The goal is to develop prediction models able to **identify which news is fake**. 

The data we will manipulate is from http://www.politifact.com. The input contains of short statements of public figures (and sometimes anonymous bloggers), plus some metadata. The output is a truth level, judged by journalists at Politifact. They use six truth levels which we coded into integers to obtain an [ordinal regression](https://en.wikipedia.org/wiki/Ordinal_regression) problem:
```
0: 'Pants on Fire!'
1: 'False'
2: 'Mostly False'
3: 'Half-True'
4: 'Mostly True'
5: 'True'
```
You goal is to classify each statement (+ metadata) into one of the categories.

### Requirements

* numpy>=1.10.0  
* matplotlib>=1.5.0 
* pandas>=0.19.0  
* scikit-learn>=0.17 (different syntaxes for v0.17 and v0.18)   
* seaborn>=0.7.1
* nltk

Further, an nltk dataset needs to be downloaded:

```
python -m nltk.downloader popular
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Exploratory data analysis

### Loading the data

In [None]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename, sep='\t')
data['date'] = pd.to_datetime(data['date'])

In [None]:
data.info()

In [None]:
data = data.fillna('')

In [None]:
data.head()

In [None]:
data.describe()

The original training data frame has 13000+ instances. In the starting kit, we give you a subset of 7569 instances for training and 2891 instances for testing.

Most columns are categorical, some have high cardinalities.

In [None]:
print(np.unique(data['state']))
print(len(np.unique(data['state'])))
data.groupby('state').count()[['job']].sort_values(
    'job', ascending=False).reset_index().rename(
    columns={'job': 'count'}).plot.bar(
    x='state', y='count', figsize=(16, 10), fontsize=18);

In [None]:
print(np.unique(data['job']))
print(len(np.unique(data['job'])))
data.groupby('job').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().plot.bar(
        x='job', y='count', figsize=(16, 10), fontsize=18);

If you want to use the journalist and the editor as input, you will need to split the lists since sometimes there are more than one of them on an instance.

In [None]:
print(np.unique(data['edited_by']))
print(len(np.unique(data['edited_by'])))
data.groupby('edited_by').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().plot.bar(
        x='edited_by', y='count', figsize=(16, 10), fontsize=10);

In [None]:
print(np.unique(data['researched_by']))
print(len(np.unique(data['researched_by'])))

In [None]:
data.groupby('researched_by').count()[['state']].sort_values(
    'state', ascending=False).reset_index().rename(
    columns={'state': 'count'}).plot.bar(
        x='researched_by', y='count', figsize=(16, 10), fontsize=6);

There are 2000+ different sources.

In [None]:
print(np.unique(data['source']))
print(len(np.unique(data['source'])))
data.groupby('source').count()[['state']].rename(
    columns={'state': 'count'}).sort_values(
    'count', ascending=False).reset_index().loc[:100].plot.bar(
        x='source', y='count', figsize=(16, 10), fontsize=10);

### Predicting truth level

The goal is to predict the truthfulness of statements. Let us group the data according to the `truth` columns:

In [None]:
data.groupby('truth').count()[['source']].reset_index().plot.bar(x='truth', y='source');

## The pipeline

For submitting at the [RAMP site](http://ramp.studio), you will have to write two classes, saved in two different files:   
* the class `FeatureExtractor`, which will be used to extract features for classification from the dataset and produce a numpy array of size (number of samples $\times$ number of features). 
* a class `Classifier` to predict 

### Feature extraction overview

Before going through the code, we first need to understand how **tf-idf** works. A **Term Frequency** is a count of how many times a word occurs in a given document (synonymous with bag of words). The **Inverse Document Frequency** is the number of times a word occurs in a corpus of documents. **tf-idf** is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting.


The ``FeatureExtractor`` class is used to extract features
from text documents. It is based on the [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) class from scikit-learn which is a [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) followed by [`TfidfTransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer).

See the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) for a general introduction to text feature extraction.

`CountVectorizer` converts a collection of text documents to a matrix of token (*word*) counts. This implementation produces a sparse representation of the counts to be passed to the `TfidfTransformer`.
The `TfidfTransformer` transforms a count matrix to a normalized tf or tf-idf representation.

A `TfidfVectorizer` does these two steps. 

The feature extractor overrides *fit* by providing the `TfidfVectorizer` with a new preprocessing step that is presented after.

### Improving  feature extraction

#### Preprocessing 

The document preprocessing can be customized in the `document_preprocessor` function.

For instance, to transform accentuated unicode symbols into their simple counterpart e.g. è -> e, the following function can be used:

In [None]:
import ast
import itertools
from collections import Counter

subjects_threshold = 250
job_threshold = 100
source_threshold = 50
state_threshold = 100

subjects = [ast.literal_eval(data['subjects'][i]) for i in range(data.shape[0])]
subjects = dict(Counter(itertools.chain.from_iterable(subjects)))
subjects = {k:v for k,v in zip(subjects.keys(), subjects.values()) if v > subjects_threshold}
subjects['Other subject'] = 1
subjects = sorted(set(subjects))
print(subjects, "\n")

job = dict(Counter(list(data['job'].values)))
del job['']
del job['None']
job = {k:v for k,v in zip(job.keys(), job.values()) if v > job_threshold}
job['Other job'] = 1
job = sorted(set(job))
print(job, "\n")

source = dict(Counter(list(data['source'].values)))
source = {k:v for k,v in zip(source.keys(), source.values()) if v > source_threshold}
source['Other source'] = 1
source = sorted(set(source))
print(source, "\n")

state = dict(Counter(list(data['state'].values)))
del state['']
state = {k:v for k,v in zip(state.keys(), state.values()) if v > state_threshold}
state['Other state'] = 1
state = sorted(set(state))
print(state)

In [None]:
import unicodedata

def document_preprocessor(doc):
    # TODO: is there a way to avoid these encode/decode calls?
    try:
        doc = unicode(doc, 'utf-8')
    except NameError:  # unicode is a default on python 3
        pass
    doc = unicodedata.normalize('NFD', doc)
    doc = doc.encode('ascii', 'ignore')
    doc = doc.decode("utf-8")
    return str(doc)

see also the `strip_accents` option of [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).


##### Stopword removal
The most frequent words often do not carry much meaning. Examples: *the, a, of, for, in, ...*. 

Stop words removal can be enabled by passing the `stopwords='english'` parameter at the initialization of the
[`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

A custom list of stop words (e.g. from NLTK) can also be used.

##### Word / character n-grams

By default, the bag of words model is use in the starting kit. To use word or character n-grams, the `analyser` and `ngram_range` parameters of [`TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) should be changed.


##### Stemming  and Lemmatization

English words like *look* can be inflected with a morphological suffix to produce *looks, looking, looked*. They share the same stem *look*. Often (but not always) it is beneficial to map all inflected forms into the stem. The most commonly used stemmer is the Porter Stemmer. The name comes from its developer, Martin Porter. `SnowballStemmer('english')` from *NLTK* is used. This stemmer is called Snowball, because Porter created a programming language with this name for creating new stemming algorithms.

Stemming can be enabled with a custom `token_processor` function, e.g.

In [None]:
from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer('english')

def token_processor(tokens):
    for token in tokens:
        yield stemmer.stem(token)

### Feature extractor

The feature extractor implements a `transform` function. It is saved in the file [`submissions/pawel_guzewicz/feature_extractor.py`](/edit/submissions/pawel_guzewicz/feature_extractor.py). It receives the pandas dataframe `X_df` defined at the beginning of the notebook. It should produce a numpy array representing the extracted features, which will then be used for the classification.  

**Note:** the following code cells are *not* executed in the notebook. The notebook saves their contents in the file specified in the first line of the cell, so you can edit your submission before running the local test below and submitting it at the RAMP site.

In [None]:
%%file submissions/pawel_guzewicz/feature_extractor.py
# -*- coding: utf-8 -*-

from __future__ import unicode_literals
import pandas as pd
import scipy
from scipy.sparse import hstack
from sklearn import preprocessing
from sklearn import decomposition
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import ast

def document_preprocessor(doc):
    """ A custom document preprocessor

    This function can be edited to add some additional
    transformation on the documents prior to tokenization.

    At present, this function passes the document through
    without modification.
    """

    return doc

def token_processor(tokens):
    """ A custom token processor
    
    This function can be edited to add some additional
    transformation on the extracted tokens (e.g. stemming)
    """

    stemmer = SnowballStemmer('english')
    for token in tokens:
        yield stemmer.stem(token)

class FeatureExtractor(TfidfVectorizer):
    """Convert a collection of raw docs to a matrix of TF-IDF features. """

    def __init__(self):
        nltk_stop_words = set(stopwords.words('english'))
        sklearn_stop_words = set(stop_words.ENGLISH_STOP_WORDS)
        another_stop_words = set(['a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', 'ain\'t', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', 'aren\'t', 'around', 'as', 'a\'s', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', 'can\'t', 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'c\'mon', 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', 'couldn\'t', 'course', 'c\'s', 'currently', 'd', 'dare', 'daren\'t', 'definitely', 'described', 'despite', 'did', 'didn\'t', 'different', 'directly', 'do', 'does', 'doesn\'t', 'doing', 'done', 'don\'t', 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', 'hadn\'t', 'half', 'happens', 'hardly', 'has', 'hasn\'t', 'have', 'haven\'t', 'having', 'he', 'he\'d', 'he\'ll', 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'here\'s', 'hereupon', 'hers', 'herself', 'he\'s', 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', 'i\'d', 'ie', 'if', 'ignored', 'i\'ll', 'i\'m', 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', 'isn\'t', 'it', 'it\'d', 'it\'ll', 'its', 'it\'s', 'itself', 'i\'ve', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'let\'s', 'like', 'liked', 'likely', 'likewise', 'little', 'll', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'mayn\'t', 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', 'mightn\'t', 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', 'mustn\'t', 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', 'needn\'t', 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', 'one\'s', 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', 'oughtn\'t', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', 'shan\'t', 'she', 'she\'d', 'she\'ll', 'she\'s', 'should', 'shouldn\'t', 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that\'ll', 'thats', 'that\'s', 'that\'ve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'there\'d', 'therefore', 'therein', 'there\'ll', 'there\'re', 'theres', 'there\'s', 'thereupon', 'there\'ve', 'these', 'they', 'they\'d', 'they\'ll', 'they\'re', 'they\'ve', 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 't\'s', 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'uucp', 'v', 'value', 'various', 've', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', 'wasn\'t', 'way', 'we', 'we\'d', 'welcome', 'well', 'we\'ll', 'went', 'were', 'we\'re', 'weren\'t', 'we\'ve', 'what', 'whatever', 'what\'ll', 'what\'s', 'what\'ve', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'where\'s', 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', 'who\'d', 'whoever', 'whole', 'who\'ll', 'whom', 'whomever', 'who\'s', 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', 'won\'t', 'would', 'wouldn\'t', 'x', 'y', 'yes', 'yet', 'you', 'you\'d', 'you\'ll', 'your', 'you\'re', 'yours', 'yourself', 'yourselves', 'you\'ve', 'z', 'zero'])
        all_stop_words = list(nltk_stop_words | sklearn_stop_words | another_stop_words)
        
        super(FeatureExtractor, self).__init__(preprocessor=document_preprocessor, analyzer='word', lowercase=True, strip_accents='unicode', stop_words=all_stop_words)#, max_df=1, min_df=0.01)

    def fit(self, X_df, y=None):
        """Learn a vocabulary dictionary of all tokens in the raw documents.

        Parameters
        ----------
        X_df : pandas.DataFrame
            a DataFrame, where the text data is stored in the ``statement``
            column.
        """

        super(FeatureExtractor, self).fit(X_df.statement, y)
        return self

    def build_tokenizer(self):
        """
        Internal function, needed to plug-in the token processor, cf.
        http://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes
        """

        tokenize = super(FeatureExtractor, self).build_tokenizer()
        return lambda doc: list(token_processor(tokenize(doc)))

    def transform(self, X_df):
        self.df = X_df[['job', 'source', 'state', 'subjects']]
        self.df.reset_index(drop=True, inplace=True)
        
        subjects = ['Candidate Biography', 'Crime', 'Economy', 'Education', 'Elections', 'Energy', 'Environment', 'Federal Budget', 'Health Care', 'Immigration', 'Jobs', 'Message Machine 2010', 'Message Machine 2012', 'Other subject', 'State Budget', 'Taxes'] 
        job = ['Democrat', 'Other job', 'Republican']
        source = ['Barack Obama', 'Chain email', 'Chris Christie', 'Hillary Clinton', 'Joe Biden', 'John Boehner', 'John McCain', 'Marco Rubio', 'Michele Bachmann', 'Mitt Romney', 'Newt Gingrich', 'Other source', 'Rick Perry', 'Rick Scott', 'Sarah Palin', 'Scott Walker'] 
        state = ['Arizona', 'Florida', 'Georgia', 'Illinois', 'Massachusetts', 'New Jersey', 'New York', 'Ohio', 'Oregon', 'Other state', 'Rhode Island', 'Texas', 'Virginia', 'Wisconsin']
        self.df = self.df.join(pd.DataFrame(columns=subjects)).join(pd.DataFrame(columns=job)).join(pd.DataFrame(columns=source)).join(pd.DataFrame(columns=state))
        self.df.fillna(0, inplace=True)
        
        for i in range(self.df.shape[0]):
            subjects_row = ast.literal_eval(self.df['subjects'][i])
            for j in subjects_row:
                if j in subjects:
                    self.df.at[i, j] = 1
                else:
                    self.df.at[1, 'Other subject'] = 1

            job_ = self.df.at[i, 'job']
            if str(job_) in job:
                self.df.at[i, job_] = 1
            else:
                self.df.at[i, 'Other job'] = 1
            
            source_ = self.df.at[i, 'source']
            if source_ in source:
                self.df.at[i, source_] = 1
            else:
                self.df.at[i, 'Other source'] = 1
            
            state_ = self.df.at[i, 'state']
            if state_ in state:
                self.df.at[i, state_] = 1
            else:
                self.df.at[i, 'Other state'] = 1
        
        self.df.drop('subjects', axis=1, inplace=True)
        self.df.drop('job', axis=1, inplace=True)
        self.df.drop('source', axis=1, inplace=True)
        self.df.drop('state', axis=1, inplace=True)

        X = hstack([super(FeatureExtractor, self).transform(X_df.statement), scipy.sparse.csr_matrix(self.df.values)]).toarray()
        #X = preprocessing.MaxAbsScaler().fit_transform(X)
        #X = decomposition.IncrementalPCA(n_components=3).fit_transform(X)
        #X = preprocessing.scale(X)
        return X

    def fit_transform(self, X_df, y=None):
        return self.fit(X_df, y).transform(X_df)

### Classifier

The classifier follows a classical scikit-learn classifier template. It should be saved in the file [`submissions/pawel_guzewicz/classifier.py`](/edit/submissions/pawel_guzewicz/classifier.py). In its simplest form it takes a scikit-learn pipeline, assigns it to `self.clf` in `__init__`, then calls its `fit` and `predict_proba` functions in the corresponding member functions.

In [4]:
%%file submissions/pawel_guzewicz/classifier.py
# -*- coding: utf-8 -*-

from math import floor
import numpy as np
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.neighbors import KNeighborsRegressor
#from sklearn.linear_model import LogisticRegression
#from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
#from sklearn.multiclass import OneVsRestClassifier
from sklearn.multiclass import OneVsOneClassifier
#from xgboost import XGBClassifier

class OneVsOneClassifierFixed(OneVsOneClassifier):
    def predict_proba(self, X):
        pred_proba = np.zeros([X.shape[0], 6])
        pred = self.predict(X)
        for i in range(pred.shape[0]):
            pred_proba[i][int(pred[i])] = 1
        return pred_proba

class KNeighborsRegressorFixed(KNeighborsRegressor):
    def predict_proba(self, X):
        pred_proba = np.zeros([X.shape[0], 6])
        pred = self.predict(X)
        for i in range(pred.shape[0]):
            pred_proba[i][int(pred[i])] = 1
        return pred_proba

class Classifier(BaseEstimator):
    # Credits for the function vcorrcoef go to:
    # https://waterprogramming.wordpress.com/2014/06/13/numpy-vectorized-correlation-coefficient/
    def __vcorrcoef__(self, X, y):
        Xm = np.ones(X.shape[1]) * np.reshape(np.mean(X, axis=1), (X.shape[0], 1))
        ym = np.mean(y)
        r_num = np.sum((X - Xm) * (y - ym), axis=1)
        r_den = np.sqrt(np.sum((X - Xm) ** 2, axis=1) * np.sum((y - ym) ** 2))
        r = r_num / r_den
        return r

    def __init__(self):
        clf1 = OneVsOneClassifierFixed(MultinomialNB(), n_jobs=-1)
        clf2 = OneVsOneClassifierFixed(RandomForestClassifier(random_state=234, class_weight="balanced"), n_jobs=-1)
        clf3 = KNeighborsRegressorFixed(n_neighbors=4, n_jobs=-1)
        clf4 = OneVsOneClassifierFixed(GradientBoostingClassifier(random_state=725, loss='exponential'), n_jobs=-1)
        clf5 = OneVsOneClassifierFixed(LinearSVC(loss='hinge', multi_class='crammer_singer'), n_jobs=-1)
        # DON'T USE: not installed on the server
        #clf6 = OneVsOneClassifierFixed(XGBClassifier(), n_jobs=-1)
        #150: 0.372, 0.350, 0.369, 0.36, 0.371, 0368
        self.clf = VotingClassifier(estimators=[('mnb', clf1), ('rf', clf2), ('knr', clf3), ('gb', clf4), ('lsvc', clf5)], voting='soft', weights=[1, 0, 0, 0, 1])
        #self.clf = OneVsOneClassifierFixed(LinearSVC(loss='hinge', multi_class='crammer_singer', C=1.3), n_jobs=-1)
        #self.clf = OneVsOneClassifier(RandomForestClassifier(class_weight="balanced"), n_jobs=-1)
        #self.clf = OneVsOneClassifier(MultinomialNB(), n_jobs=-1)
        # 0.396
        #OneVsOneClassifier(MultinomialNB(), n_jobs=-1)
        # 0.392
        #OneVsOneClassifier(RandomForestClassifier(random_state=234, class_weight="balanced"), n_jobs=-1)
        # 0.39
        #KNeighborsRegressor(n_neighbors=4, n_jobs=-1)
        # 0.387
        #OneVsOneClassifier(GradientBoostingClassifier(random_state=725, loss='exponential'), n_jobs=-1)
        
        # 0.356
        #OneVsOneClassifier(SGDClassifier(loss='modified_huber', tol=1e-3), n_jobs=-1)
        # 0.352
        #OneVsOneClassifier(MLPClassifier(activation='relu', solver='lbfgs', random_state=111), n_jobs=-1)

    def fit(self, X, y):
        X = np.hstack([X, X ** 2])
        correlations = self.__vcorrcoef__(X.T, y)
        correlations_with_numbers = zip(correlations, range(len(correlations)))
        correlations_with_numbers = sorted(correlations_with_numbers, key=lambda tup: abs(tup[0]), reverse=True)
        self.features = sorted(map(lambda tup: tup[1], correlations_with_numbers[:20]))
        self.clf.fit(X[:, self.features], y)

    def predict(self, X):
        X = np.hstack([X, X ** 2])
        return self.clf.predict(X[:, self.features].todense())

    def predict_proba(self, X):
        X = np.hstack([X, X ** 2])
        pred_proba = np.zeros([X[:, self.features].shape[0], 6])
        try:
            pred_proba = self.clf.predict_proba(X[:, self.features])
        except AttributeError:
            pred = self.clf.predict(X[:, self.features])
            for i in range(pred.shape[0]):
                pred_proba[i][int(pred[i])] = 1
        return pred_proba

Overwriting submissions/pawel_guzewicz/classifier.py


## Local testing (before submission)

It is <b><span style="color:red">important that you test your submission files before submitting them</span></b>. For this we provide a unit test. Note that the test runs on your files in [`submissions/pawel_guzewicz`](/tree/submissions/pawel_guzewicz), not on the classes defined in the cells of this notebook.

First `pip install ramp-workflow` or install it from the [github repo](https://github.com/paris-saclay-cds/ramp-workflow). Make sure that the python files `feature_extractor.py` and `classifier.py` are in the  [`submissions/pawel_guzewicz`](/tree/submissions/pawel_guzewicz) folder, and the data `train.csv` and `test.csv` are in [`data`](/tree/data). Then run

```ramp_test_submission```

If it runs and print training and test errors on each fold, then you can submit the code.

### Training on the small subset of training set (quick)

In [5]:
runs = 10
scores_sum = 0
for i in range(runs):
    score = !ramp_test_submission --quick-test --submission=pawel_guzewicz 2> /dev/null | grep test | tail -1 | cut -d' ' -f 4
    score = score.s.split(" ")[0][13:18]
    if score[3] == '\\':
        score = score[:3]
    elif score[4] == '\\':
        score = score[:4]
    print(score)
    scores_sum += float(score)
print("\n{0:.3f}".format(scores_sum / runs))

0.326
0.317
0.337
0.314
0.376
0.363
0.262
0.362
0.313
0.342

0.331


### Training on the whole training set

In [None]:
score = !ramp_test_submission --submission=pawel_guzewicz 2> /dev/null | grep test | tail -1 | cut -d' ' -f 4
score = score.s.split(" ")[0][13:18]
if score[3] == '\\':
    score = score[:3]
else if score[4] == '\\':
    score = score[:4]
print(float(score))

## Submitting to [ramp.studio](http://ramp.studio)

Once you found a good feature extractor and classifier, you can submit them to [ramp.studio](http://www.ramp.studio). First, if it is your first time using RAMP, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). Then find an open event on the particular problem, for example, the event fake_news ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi)) for this RAMP. Sign up for the event. Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.

Once your signup request is accepted, you can go to your sandbox ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17/sandbox), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi/sandbox)) and copy-paste (or upload) [`feature_extractor.py`](/edit/submissions/pawel_guzewicz/feature_extractor.py) and [`classifier.py`](/edit/submissions/pawel_guzewicz/classifier.py) from `submissions/pawel_guzewicz`. Save it, rename it, then submit it. The submission is trained and tested on our backend in the same way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17/my_submissions), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi/my_submissions)). Once it is trained, you get a mail, and your submission shows up on the public leaderboard ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17/leaderboard), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi/leaderboard)). 
If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the "Failed submissions" table in my submissions ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17/my_submissions), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi/my_submissions)). You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual way to work with RAMP is to explore solutions, add feature transformations, select models, perhaps do some AutoML/hyperopt, etc., _locally_, and checking them with `ramp_test_submission`. The script prints mean cross-validation scores 
```
----------------------------
train sacc = 0.77 ± 0.012
train acc = 0.983 ± 0.01
train tfacc = 0.835 ± 0.014
valid sacc = 0.361 ± 0.05
valid acc = 0.144 ± 0.119
valid tfacc = 0.575 ± 0.101
test sacc = 0.355 ± 0.013
test acc = 0.197 ± 0.023
test tfacc = 0.544 ± 0.021
```
The official score in this RAMP (the first score column after "historical contributivity" on the leader board ([Saclay Datacamp](http://www.ramp.studio/events/fake_news_saclay_datacamp_17/leaderboard), [DataFest Tbilisi](https://www.ramp.studio/events/fake_news_tbilisi/leaderboard)) is smoothed accuracy, so the line that is relevant in the output of `ramp_test_submission` is `valid sacc = 0.361 ± 0.05`. When the score is good enough, you can submit it at the RAMP.

## More information

You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow).

## Contact

Don't hesitate to [contact us](mailto:admin@ramp.studio?subject=fake news notebook).