# Selecting words to use as features
In this notebook, we'll be using TFIDF to narrow down the tens of thousands of unique words in our articles and their titles into a more condensed list of important words. 

### I. [Initial vectorization](#Initial-vectorization)
### II. [Stemming](#Stemming)
### III. [TFIDF vectorization](#TFIDF-vectorization)

## Reading in libraries and data

In [163]:
import pandas as pd
import numpy as np
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Git would not allow us to push a CSV containing the entire dataset so we had to split it up by X and y as well as by train and test sets. Below, we read in all of those files and put them back into one dataframe.

In [164]:
X_train = pd.read_csv('../datasets/X_train_w_SA.csv')
X_test = pd.read_csv('../datasets/X_test_w_SA.csv')
y_train = pd.read_csv('../datasets/y_train.csv')
y_test = pd.read_csv('../datasets/y_test.csv')

y_train['train_dataset'] = 1
y_test['train_dataset'] = 0

X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

X.reset_index(drop=True, inplace=True)
y.reset_index(drop=True, inplace=True)

df = pd.concat([X, y], axis = 1)

Now, we'll vectorize the combined title and text to get a sense of the frequency with which each word is used in these articles.

## Initial vectorization

In [167]:
# concatenating titles and text
df['all_text'] = [np.nan]*df.shape[0]

for i in range(df.shape[0]):
    text = df.loc[i, 'text']
    title = df.loc[i, 'title']
    all_text = title + ' ' + text
    df.loc[i,'all_text'] = all_text

In [None]:
# vectorizing into a large dataframe

cvec = CountVectorizer(stop_words='english')
cvec.fit(df['text'])
vec_df = pd.DataFrame(cvec.transform(df['text']).todense(),
                      columns = cvec.get_feature_names())

print(vec_df.shape)
vec_df.head()

In [None]:
# utilizing this tool to narrow to actual dictionary words
# http://openbookproject.net/courses/python4fun/spellcheck.html

words = open("../datasets/spellcheck.txt").readlines()
words = [word.strip() for word in words]

real_words = []

for i, col in enumerate(vec_df):
    if col in words:
        real_words.append(col)
    if i % 20000 == 0:
        print(i)
        
        
vec_df = vec_df[real_words]

In [166]:
vec_df.head()

0,aal,aardvark,aba,aback,abacus,abandon,abandoned,abandoning,abandonment,abandons,...,zonation,zone,zoned,zones,zoning,zoo,zoom,zorro,zu,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39853,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39854,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39855,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
39856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Stemming

In order to find the most important words, we concatenated the titles with their corresponding text and created a new version of that text where only the stems remained. We then vectorized again to find the frequency and importance of use for each stem.

In [168]:
# instantiating stemmer and creating a list of real words
stemmer = PorterStemmer()
word_list = vec_df.columns

In [169]:
# creating a column for the stemmed versions of the text
df['stemmed'] = [np.nan]*df.shape[0]

# concatenating stems from text into strings in the stemmed column
for i, t in enumerate(list(df['all_text'])):
    stemmed = []
    for word in t.split(' '):
        if word.lower() in word_list:
            stem = stemmer.stem(word)
            stemmed.append(stem)
        else:
            pass
            
    df.loc[i,'stemmed'] = ' '.join(stemmed)
    if (i % 3000) == 0:
        print(i)

0
3000
6000
9000
12000
15000
18000
21000
24000
27000
30000
33000
36000
39000


Here, we vectorize the text again so we have a dataframe of only the stems.

In [170]:
cv = CountVectorizer(stop_words='english')
cv.fit(df['stemmed'])

words_cv = cv.transform(df['stemmed'])

stems_df = pd.DataFrame(words_cv.todense(), columns=cv.get_feature_names())

## TFIDF vectorization

In order to determine the importance of words in one "document" relative to another (in this case, one document is fake news and the other is real news), we concatenated all of the stemmed text into two long strings, one for each class and compared those to one another. The result was a dataframe with the stems and their relative importance to each "document"

In [171]:
r_string = ''
f_string = ''

for i in range(df.shape[0]):
    string = df.loc[i,'stemmed']
    if df.loc[i,'is_true'] == 1:
        r_string += ' '+string
    else: 
        f_string += ' '+string

tvec = TfidfVectorizer(stop_words='english')

tvec.fit([r_string,f_string])

tv = pd.DataFrame(tvec.transform([r_string, f_string]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['real', 'fake'])

tv.head()

We then utilized this dataframe to capture the most important words from each document as our features. We set a threshold of 0.01 in order to narrow the words down and got a list of 577 words.

In [175]:
tv_t = tv.T

r_words = set(tv_t[tv_t['fake'] > 0.01].index)
f_words = set(tv_t[tv_t['fake'] > 0.01].index)
selected_words = list(r_words.union(f_words))
selected_words.sort()
len(selected_words)a

577

Using those selected words, we narrowed dataframe and added those 577 columns into the same dataframe as our engineered features relating to punctuation, sentiment, or parts of speech. In total, we had a dataframe with our approximately 40,000 samples and 626 features.

In [178]:
full_df = df.merge(stems_df[selected_words], right_index = True, left_index = True)
full_df.shape

In [None]:
# creating a list of the punctuation, sentiment, and parts of speech columns
feats = list(full_df.columns[3:52])
feats

In [204]:
# adding all of our selected words for a full list of our features
feats.extend(selected_words)
len(feats)

626

In [206]:
# exporting new X_train and X_test CSVs with our chosen features
full_df.loc[full_df['train_dataset'] == 1, feats].to_csv('../datasets/X_train_w_SA_and_words.csv', index = False)
full_df.loc[full_df['train_dataset'] == 0, feats].to_csv('../datasets/X_test_w_SA_and_words.csv', index = False)

Now we're going to explore some other features we can add to our model before doing some EDA and actual modeling.