In [11]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

#importing helper functions for pre-processing data
from util import cleaning_data, stemming

In [2]:
# cleanine_data function loads in data, adds labels, and removes punctuation 
data = cleaning_data("data/true.csv", "data/fake.csv")

  data = true_df.append(fake_df).sample(frac=1).reset_index().drop(columns=['index'])


Example of the dataset `data` consisting of both labels(1 is True, 0 is Fake):

In [3]:
data

Unnamed: 0,text,target
0,new us ambassador nikki haley strutted un frid...,0
1,dear mike tomlin james harrison ben roethlisbe...,0
2,syria opposition wants russia states put press...,1
3,south africa revenue service ask parliament in...,1
4,weak recent inflation readings worry suggest f...,1
...,...,...
44893,hurricane irma moving little west category hur...,0
44894,patrick henningsen 21st century wirewatching w...,0
44895,big name security state sovereignty tennessee ...,0
44896,jason kander democrat running united states se...,0


In [4]:
print('distribution of word frequencies before stemming: ')
pd.Series(' '.join(data['text']).split()).value_counts().describe()

distribution of word frequencies before stemming: 


count    121766.000000
mean         87.001774
std         879.965508
min           1.000000
25%           1.000000
50%           2.000000
75%          11.000000
max      133992.000000
dtype: float64

In [5]:
print('top 10 words before stemming are: ')
pd.Series(' '.join(data['text']).split()).value_counts()[:10]

top 10 words before stemming are: 


trump        133992
said         132816
president     55890
would         55165
people        41855
one           37892
state         34486
also          31357
new           30310
clinton       28694
dtype: int64

## From the descriptive statistics above: 

- The dataset consists `121,766` unique words after stop words (i.e., *you*, *she*, *and*).
- The median frequency among all words equals to `2`. Meaning, at least a half of all words were mentioned once or twice in the entire dataset.
- The word that was mentioned the most is `Trump`. It was mentioned `133,992` times. 

Having `121,766` features for a `40,000`-observation dataset we are risking to encounter **Curse of Dimensionality**. We need to reduce the number of total features before training the model.

In [7]:
data['text'] = stemming(data['text'])

In [8]:
print('distribution of word frequencies after stemming: ')
pd.Series(' '.join(data['text']).split()).value_counts().describe()

distribution of word frequencies after stemming: 


count     95924.000000
mean        110.440119
std        1126.504906
min           1.000000
25%           1.000000
50%           2.000000
75%           8.000000
max      134244.000000
dtype: float64

In [10]:
print('top 10 stems after stemming are: ')
pd.Series(' '.join(data['text']).split()).value_counts()[:10]

top 10 stems after stemming are: 


trump         134244
said          132816
state          63382
presid         60429
would          55165
peopl          42011
year           41759
republican     39743
one            39104
say            36911
dtype: int64

In [12]:
# Split the data into two parts: training data (7/10) and other data (3/10)
train_text, val_test_text = train_test_split(data, random_state=1234, test_size=0.3, stratify=data['target'])

# Split other data into two parts: validation data (1/3 * 3/10 = 1/10) and testing data (2/3 * 3/10 = 2/10)
val_text, test_text = train_test_split(val_test_text, random_state=1234, test_size=0.6, stratify=val_test_text['target'])

Example of the `train_text` data frame:

In [13]:
train_text

Unnamed: 0,text,target
22225,syrian armi iranian back militia back russian ...,1
29542,initi run megyn kelli sunday newsmagazin show ...,0
15914,21st centuri wire say peopl accept certain amo...,0
21228,wealthi turkish gold trader decis hire former ...,1
12538,dutch businessman convict april sell weapon ex...,1
...,...,...
10971,kate steinl wrong race kill someon obama advoc...,0
2866,mani articl written georg soro collectivist ac...,0
40305,china export oil product north korea novemb ch...,1
33563,donald trump move step closer offici sanction ...,0


For our baseline model, we will be using the `TF-IDF` Vectorizer to pre-process articles and then apply Logistic Classifier.

- **fit_transform()** method learns vocabulary and `IDF` used for both training & test data. Returns document-term matrix with calculated `TF-IDF` values.

- **transform()** method uses the vocabulary and document frequencies (df) learned by **fit_transform()**. Returns document-term matrix with calculated `TF-IDF` values.

In [25]:
# Note, ngrams = 1, which is the default value if not specified in TfidfVectorizer. 
text_transformer = TfidfVectorizer(stop_words='english', max_features=1000)

# vectorize train and test data. Produce TF-IDF for train data
X_train_text = text_transformer.fit_transform(train_text['text'])
X_val_text = text_transformer.transform(val_text['text'])
X_test_text = text_transformer.transform(test_text['text'])

Below is the example of the stop words used in TfidfVectorizer that will be filtered out from our observations (i.e. articles), both 'training' and 'test':

In [26]:
feature_names = text_transformer.get_feature_names_out()
feature_names[:100]

array(['000', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18',
       '20', '2012', '2013', '2014', '2015', '2016', '2017', '21st',
       '21wire', '24', '25', '30', '50', 'abl', 'abort', 'absolut',
       'abus', 'accept', 'access', 'accord', 'account', 'accus', 'act',
       'action', 'activ', 'activist', 'actual', 'ad', 'addit', 'address',
       'administr', 'admit', 'advanc', 'advis', 'affair', 'affect',
       'african', 'agenc', 'agenda', 'agent', 'ago', 'agre', 'agreement',
       'ahead', 'aid', 'aim', 'air', 'al', 'alleg', 'alli', 'allow',
       'alreadi', 'alway', 'ambassador', 'amend', 'america', 'american',
       'announc', 'anoth', 'answer', 'anti', 'anyon', 'anyth', 'appar',
       'appeal', 'appear', 'appoint', 'approach', 'approv', 'april',
       'arab', 'arabia', 'area', 'argu', 'arm', 'armi', 'arrest', 'arriv',
       'articl', 'ask', 'assault', 'assist', 'associ', 'attack',
       'attempt', 'attend', 'attent', 'attorney', 'august'], dtype=object)

In [27]:
print('The number of observations (articles) in  the train data: ', X_train_text.shape[0])
print('The number of features (tokens) in  the train data: ', X_train_text.shape[1])

The number of observations (articles) in  the train data:  31428
The number of features (tokens) in  the train data:  1000


Example of `TF-IDF` matrix, **val_text**, for the validation dataset:

In [28]:
X_val_text.todense()

matrix([[0.02544547, 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.07992456, 0.        , 0.10354717, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.04468951, 0.05243942, ..., 0.08775583, 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

We will be using the `Logistic Classifier` as our baseline model for training:

In [29]:
logit = LogisticRegression(penalty = 'l2', C = 1, solver= 'sag', multi_class = 'multinomial')
logit.fit(X_train_text, train_text['target'])

LogisticRegression(C=1, multi_class='multinomial', solver='sag')

Now that we have trained our model, we will apply it to predict labels (true/false) for articles in the test data and calculate the accuracy score:

In [31]:
train_predicted_label = logit.predict(X_train_text)
train_accuracy_score = accuracy_score(train_text['target'], train_predicted_label)

predicted_label = logit.predict(X_val_text)
accuracy_score = accuracy_score(val_text['target'], predicted_label)

print('the accuracy score on the training data is: ', train_accuracy_score)
print('the accuracy score on the validation data is: ', accuracy_score)

TypeError: 'numpy.float64' object is not callable

**Future steps:**

- Continue cleaning data with the use of Regex and other packages (digits, punctation, 'Router', '21st century')

- Further analysis of data

- Re-run model after data is cleaned

- Discover options to improve the model

Notes form meeting with Cole:

try different models with different # number of features 

1. spacy for steming/punctation - `done`
2. remove article sources - `done`
3. try different # of features - `pending`

parced list of words

analysis: size of trainings 
distribution of words 