# Sport VS Politics

Make sure to have the .txt files in a folder close to this notebook, called `data/`.

* https://github.com/Parassharmaa/Tweet-Classifier/tree/master/twc/data

* http://www.markhneedham.com/blog/2015/02/15/pythonscikit-learn-calculating-tfidf-on-how-i-met-your-mother-transcripts/

* http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

* http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [134]:
import os

data_folder = "data"

labels_to_int = {"politics.txt": 0, "sports.txt": 1}
int_to_labels = {v: k for k, v in labels_to_int.items()}

`corpus` is a list that contains a string for each file.

In [103]:
corpus, labels = [], []
for input_text_file in os.listdir(data_folder):
    
    # We chose only to train a TF-IDF on top of politics and sports
    # tweets, so far.
    if input_text_file not in ("politics.txt", "sports.txt"):
        continue

    with open(os.path.join(data_folder, input_text_file)) as file_desc:
        for line in file_desc:
            # Add the tweet to the corpus "vector", and the relative label
            # (a numeric value) associated to the tweet, to a y "vector".
            corpus.append(line.strip().lower())
            labels.append(labels_to_int[input_text_file])

print("Number of files loaded as the corpus:", len(corpus))
print("Number of labels loaded:", len(labels))
print("The following are the first and last ten tweets in the corpus, along with their label:\n")

for tw_id, tw in enumerate(corpus[:10]):
    line_id = tw_id
    print(line_id+1, int_to_labels[labels[line_id]], "--", tw)
print()

for tw_id, tw in enumerate(corpus[-10:]):
    line_id = tw_id+len(corpus)-10
    print(line_id+1, int_to_labels[labels[line_id]], "--", tw)
print()

Number of files loaded as the corpus: 692
Number of labels loaded: 692
The following are the first and last ten tweets in the corpus, along with their label:

1 politics.txt -- megathread: republican health care plan passes house vote
2 politics.txt -- president trump praised australia's universal health care right after the house repealed obamacare
3 politics.txt -- gop congressman: i didn't fully read health care bill
4 politics.txt -- trump's fitness to serve is 'officially part of the discussion in congress'
5 politics.txt -- in rare unity, hospitals, doctors and insurers criticize health bill
6 politics.txt -- senate gop rejects house obamacare bill
7 politics.txt -- federal judge orders georgia to re-open voter registration ahead of 6th district runoff
8 politics.txt -- aclu says demanding us citizens unlock phones at the border is unconstitutional
9 politics.txt -- bernie sanders couldn’t stop laughing when trump didn’t realize he was praising australia’s universal healthcare
10

Feed the TF-IDF converter to produce a sparse matrix.

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df=0, stop_words='english')

tfidf_matrix =  tf.fit_transform(corpus)
feature_names = tf.get_feature_names()

Reserve 5% of the tweets as a "test" sample, and use the remaining to train the random forest on.

In [105]:
# We use the train_test_split function to shuffle and split the TF-IDF matrix and
# the labels vector, keeping the same "shuffling order" between the two arrays.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, labels, test_size=0.05)

Now perform the actual training of the random forest on top of the TF-TDX matrix.

In [132]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2)

clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

Test the trained random forest classifier on the left-out test data:

In [133]:
clf.score(X_test, y_test)

0.62857142857142856