# Feature Selection

This notebook compares features extracted from twitter threads against one another. For a look into the Twitter data itself, see the [Exploratory Data Analysis](./exploratory_data_analysis.ipynb)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from lib.util import fetch_thread

In [2]:
Z, y = fetch_thread("germanwings-crash")
Z.describe()

Unnamed: 0,favorite_count_mean,favorite_count_sum,favorite_count_var,user_mentions_mean,user_mentions_sum,user_mentions_var,media_count_mean,media_count_sum,media_count_var,sensitive_mean,...,created,src.created_at,src.tweets_total,first_resp,last_resp,resp_var,component_count,largest_cc_diameter,time_to_first_resp,time_to_last_resp
count,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,...,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0,405.0
mean,4.798631e-16,-2.535695e-17,-4.79726e-18,2.456197e-16,-5.2632800000000005e-17,1.748944e-16,-7.319248e-16,8.278700000000001e-17,-5.756712e-16,8.539123e-17,...,1.167455e-12,1.790474e-15,-9.375217000000001e-17,4.124152e-13,-3.996101e-13,9.848089000000001e-17,3.528042e-16,-3.772017e-16,-1.6996010000000002e-17,-1.112964e-16
std,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
min,-0.500453,-0.4164179,-0.160863,-2.013617,-0.7834236,-0.8595646,-0.7377092,-0.6853851,-0.8290427,-0.1954637,...,-0.6262043,-1.351978,-0.9986559,-0.639483,-0.7303974,-0.2341164,-1.682398,-1.09645,-0.1161921,-0.4846707
25%,-0.3813547,-0.3644712,-0.1596798,-0.6164995,-0.642219,-0.4743147,-0.7377092,-0.6853851,-0.8290427,-0.1954637,...,-0.5972268,-0.6506988,-0.7932471,-0.6040732,-0.6726064,-0.2340749,-0.5136652,-0.3960364,-0.1129378,-0.4557384
50%,-0.2629071,-0.2822223,-0.1555286,-0.2394995,-0.3127418,-0.2875268,-0.429293,-0.182521,-0.4795957,-0.1954637,...,-0.5691532,-0.2159354,-0.2830801,-0.5656328,-0.5751977,-0.2326118,-0.5136652,-0.3960364,-0.1090147,-0.3903084
75%,-0.06506063,-0.1047378,-0.1384019,0.4257945,0.3462125,0.0393519,0.4398801,0.320343,0.7434684,-0.1954637,...,0.8221633,0.4621539,0.664022,0.7782767,0.5762397,-0.2099105,0.6550674,0.3043773,-0.09220795,-0.04945117
max,12.65699,10.6439,12.72865,7.559225,7.924188,10.31268,3.580118,5.851848,2.84015,10.7925,...,2.537289,2.571983,6.646507,3.898232,4.613243,9.88673,6.49873,5.907687,18.98264,8.028642


## Univariate Feature Selection

### Pearson Correlation

For each potentially useful feature $X$, calculate the correlation coefficient between the values of that feature and response variable $Y$, which is 0 if this thread is non-rumor and 1 if rumor.

$$\rho = \frac{cov(X,Y)}{\sigma_X \sigma_Y}$$

* $cov$ is the covariance
* $\sigma_X$ is the standard deviation of feature X
* $\sigma_Y$ is the standard deviation of response variable Y

In [None]:
from scipy.stats import pearsonr

corr = Z.apply(lambda x: pearsonr(x, y), axis=0, result_type='expand')\
    .T \
    .rename(columns={0: "coef", 1: "pval"}) \
    .sort_values(by=["coef"], ascending=False)

plt.figure(figsize=(15,15))
plt.title("Correlation Between Features and Rumor Label (Germanwings Crash)")

labels = [index + " (p = %.2f)" % round(row.pval, 2) for index, row in corr.iterrows()]
ax = sns.barplot(y=labels, x=corr.coef, palette="Set2")
ax.set(xlabel="Pearson Correlation Coefficient", ylabel="Feature")

offset = 0.01
i = 0
for index, row in corr.iterrows():
    ax.text(row.coef + (offset if row.coef > 0 else -offset), i, round(row.coef, 2), 
            color="black",
            ha="center",
            fontsize='small')
    i += 1

The p-values for each feature rougly mean the probability of an uncorrelated system creating a correlation value of this magnitude by chance. By using Pearson correlation, we're assuming that there's a linear relationship between a particular feature $X$ and the classification response variable $Y$.

The results of this test seem to support what are intuitively relevant features in rumor classification.
* If the original tweet poster is verified, `src.user_verified` negatively correlates with being classified as fake news.
* Surprisingly the time of the first response, `first_resp`, has the strongest positive correlation of all the features for this dataset.

## Regression

When all the features are on the same scale, the most important features should have the highest coefficients in the model and non-relevant features should have values close to zero. TK

## Colinearity

In [None]:
f, ax = plt.subplots(figsize=(18,18))
sns.heatmap(Z.corr(), annot=True, linewidth=0.5, fmt='.1f', ax=ax)