# Predict hoax using scaled features

Goal: Match or surpass the $R^2$ achieved with unscaled features

### notes

#### ➜ Fewer coefficients present without scaling. Why?

- There was a typo and a feature slipped through that wasn't supposed to
    - ➜ fixed
    - The sets are the same now, but the phenomenon is remaining.
- Maybe it's because the intercept was scaled to zero?  
    - ➜ Try only scaling the original floats.
    - Score improved to .54/.5 , but unscaled remains .6
- Look at (unscaled_coeff / $\sigma$) to understand feature importance
    - ➜ Try repeating the regression with these features only
    - Score improved to .55/.5 , and unscaled remains .58
- Checked Lasso (with all features)
    - ➜ same results as Ridge: .54/.5
- Relaxed my limitation of C,
    - ➜ improved to .57/.5 with C=6
- Ran logistic regression on unscaled features with no regularization
    - ➜ improved to .61/.5
    - Far out-performed the paper-author's models
- Let's try SFS with scaled features
    - Success!

## exploration controls

In [1]:
random_seed = 0

## get data

### imports

In [2]:
import os, re, patsy
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector, RFE, RFECV
from sklearn.preprocessing import StandardScaler
path = '/home/bhrdwj/git/predwikt/data/raw/wiki_reliability/unzipped/'

In [3]:
fea = (pd.read_csv(path+'hoax_features.csv', usecols=lambda x: x not in ['Unnamed: 0'])
       .rename(columns={'headings_by_level(2)':'headings_by_level_2', 'revision_id.key':'revision_id_key'}))

### train test split

#### Make series of negative revisions and their revision keys, and vice versa

In [4]:
neg_revs = fea[['revision_id', 'revision_id_key', 'has_template']]
neg_revs = neg_revs.loc[neg_revs.has_template==0].set_index('revision_id')['revision_id_key']
pos_revs = fea[['revision_id', 'revision_id_key', 'has_template']]
pos_revs = pos_revs.loc[pos_revs.has_template==1].set_index('revision_id')['revision_id_key']

neg_revs.shape #, pos_revs.shape

(1390,)

#### Test-train split the neg_revs, and form dfte and dftr

In [5]:
neg_revs_tr, neg_revs_te = train_test_split(neg_revs, test_size=.2, random_state=0)
pos_revs_tr = pos_revs[neg_revs_tr.values]
pos_revs_te = pos_revs[neg_revs_te.values]

In [6]:
revs_tr = pd.concat((neg_revs_tr, pos_revs_tr))
revs_te = pd.concat((neg_revs_te, pos_revs_te))

In [7]:
fea_rev = fea.set_index('revision_id')
dftr = fea_rev.loc[revs_tr.index].dropna()
dfte = fea_rev.loc[revs_te.index].dropna()

In [8]:
del neg_revs, pos_revs, neg_revs_tr, pos_revs_tr, neg_revs_te, pos_revs_te, revs_tr, revs_te, fea_rev

In [9]:
dftr[dftr.columns.difference(['page_id','revision_id_key','has_template'])].describe().T.sort_values(by='mean');

### prep

In [10]:
# remove non-features; dummify categoricals
ytr = dftr.has_template
Xtr = dftr[dftr.columns.difference(['page_id','revision_id_key','has_template'])]
Xtr = patsy.dmatrix('~ '+' + '.join(Xtr.columns), data=Xtr, NA_action='drop', return_type='dataframe')

yte = dfte.has_template
Xte = dfte[dfte.columns.difference(['page_id','revision_id_key','has_template'])]
Xte = patsy.dmatrix('~ '+' + '.join(Xte.columns), data=Xte, NA_action='drop', return_type='dataframe')

# make complete list of columns in case the test set doesn't include any of a rare class
Xcols = list(
    set(Xtr.columns.tolist())
    .union(set(Xte.columns.tolist()))
)

for col in Xcols:
    if col not in Xte:
        Xte[col] = 0
    if col not in Xtr:
        Xtr[col] = 0

### to scale or not to scale

In [11]:
scaler = StandardScaler()
def scale_reindex(df):
    matrix = scaler.fit_transform(df)
    return pd.DataFrame(data=matrix, index=df.index, columns=df.columns)

[Xtr,Xte] = map(scale_reindex, [Xtr,Xte])

## SFS Logistic Regression

### feature selection (keep 2/3 of features)

In [12]:
estimator = LogisticRegression(penalty='l2', C=.1, max_iter=10000, fit_intercept=False)
sfs = SequentialFeatureSelector(estimator=estimator, n_features_to_select=8)
sfs.fit(Xtr, ytr)

SequentialFeatureSelector(estimator=LogisticRegression(C=0.1,
                                                       fit_intercept=False,
                                                       max_iter=10000),
                          n_features_to_select=8)

In [13]:
feats1 = Xtr.columns[sfs.get_support()].tolist()
feats1 = [i for i in feats1 if not i.find('page_id') > -1]

In [14]:
len(feats1)

8

### initialize and fit

In [15]:
lr = LogisticRegression(penalty='l2', C=.1, max_iter=10000, fit_intercept=True)
lr.fit(Xtr[feats1], ytr)

LogisticRegression(C=0.1, max_iter=10000)

#### check results

In [16]:
print(f'Training accuracy {lr.score(Xtr[feats1], ytr)}')
print(f'Testing accuracy {lr.score(Xte[feats1], yte)}')
print(f'Baseline accuracy {yte.mean()}')

Training accuracy 0.6033318325078794
Testing accuracy 0.6151079136690647
Baseline accuracy 0.5


#### review fitted coefficients

In [17]:
coeffs = pd.Series(np.ravel(lr.coef_), index=feats1, name='coeffs')

In [18]:
coeffs.sort_values(ascending=False, key=abs)

external_links                   -0.132272
words_to_watch_matches            0.119113
paragraphs_without_refs           0.113074
shortened_footnote_templates     -0.085422
revision_templates                0.077559
headings_by_level_2              -0.060653
article_quality_score[T.Start]   -0.017786
article_quality_score[T.Stub]     0.003022
Name: coeffs, dtype: float64