# Lab 2: sport vs politics

**Requirement:**

Build a system that:

1. Everyday collects a large set of random tweets and groups them in tweets about politics and about sport
2. For each of the two groups, shows the main topics of discussion

## Data Collection

Tweepy is a Python library for accessing the Twitter API. Here is used to extract text from the tweets to build our dataset.

First we authenticate in order to use the API.

In [1]:
import tweepy

bearer_token = "your bearer token" # Twitter API
client = tweepy.Client(bearer_token=bearer_token) # OAuth2.0 Version

Three request are made to get the tweets from the topics:
- sport
- politics
- random topic.

In [2]:
import pandas as pd

limit = 1000 # nr. tweets

# request sports tweets
query_sports = 'context:47.10050757844 -is:retweet lang:en'
tweets_sports = tweepy.Paginator(client.search_recent_tweets, query=query_sports,
                              tweet_fields=['text'], max_results=100).flatten(limit=limit)

df_tweets_sports = pd.DataFrame([tweet.text for tweet in tweets_sports]) # convert to pandas df

# request politics tweets
query_politics = 'context:131.1291447199595782144 -is:retweet lang:en'
tweets_politics = tweepy.Paginator(client.search_recent_tweets, query=query_politics,
                              tweet_fields=['text'], max_results=100).flatten(limit=limit)

df_tweets_politics = pd.DataFrame([tweet.text for tweet in tweets_politics]) # convert to pandas df

# request random tweets
query_random = 'a lang:en' # fix this
tweets_random = tweepy.Paginator(client.search_recent_tweets, query=query_random,
                              tweet_fields=['text'], max_results=100).flatten(limit=limit)

df_tweets_random = pd.DataFrame([tweet.text for tweet in tweets_random]) # convert to pandas df

  data = [data_type(result) for result in data]


The three groups will be the labels for each class, so in future they can be used with supervised learning.

Then all the collected data is concatenated in an unique dataset and saved to a .csv file.

In [3]:
# add label column
df_tweets_sports['y'] = 0
df_tweets_politics['y'] = 1
df_tweets_random['y'] = 2


# concatenate data by rows
df_D = pd.concat([df_tweets_politics, df_tweets_sports, df_tweets_random], axis = 0)
df_D = df_D.rename({0: 'X',}, axis=1) # rename features data to X
display(df_D)

# export to csv
df_D.to_csv("data/dataset.csv", encoding='utf-8')

Unnamed: 0,X,y
0,@NGB2020 @LivingWillie @KariLake @katiehobbs E...,1
1,What’s going on in the #arizonaelections is di...,1
2,Have been spoon-fed some delightful bits of th...,1
3,@RepJayapal That still wouldn't stop the crime...,1
4,@HawleyMO Thank you for supporting the Branson...,1
...,...,...
995,RT @dirmaxdeyforU: A Thread of Hilarious Photo...,2
996,RT @labtech666: First test run of a salad bowl...,2
997,RT @AdoptionsUk: Please retweet to help Indy f...,2
998,RT @dollkura: ☆ gcash giveaway ! because it’s ...,2


## Pre-processing

Collected data is loaded from the .csv file.

The dataset is composed by tweets labeled by politics and sport topics.
- y=0 are sport tweets
- y=1 are politics tweets
- y=2 are random tweets

In [4]:
# load dataset
df_D = pd.read_csv("data/dataset.csv")
display(df_D)

Unnamed: 0.1,Unnamed: 0,X,y
0,0,@NGB2020 @LivingWillie @KariLake @katiehobbs E...,1
1,1,What’s going on in the #arizonaelections is di...,1
2,2,Have been spoon-fed some delightful bits of th...,1
3,3,@RepJayapal That still wouldn't stop the crime...,1
4,4,@HawleyMO Thank you for supporting the Branson...,1
...,...,...,...
2995,995,RT @dirmaxdeyforU: A Thread of Hilarious Photo...,2
2996,996,RT @labtech666: First test run of a salad bowl...,2
2997,997,RT @AdoptionsUk: Please retweet to help Indy f...,2
2998,998,RT @dollkura: ☆ gcash giveaway ! because it’s ...,2


### Bag-of-Words

Is used the scikit-learn implementation of bag of word using the CountVectorizer class.
It take an array of text as input and return a bag-of-words model.

#### Less frequently words
To lower the dimension we can clean the words that appears less frequently, is used the "min_df" to set the minimum number of documents that the word needs to appear in.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(min_df=5)
count.fit(df_D.loc[:,'X']) # generate Bag-of-words model

print("Vocabulary size: {}". format(len(count.vocabulary_)))
#print("Vocabulary content:\n {}".format(count.vocabulary_))

Vocabulary size: 1646


### Stemming
We can further improve our bag-of-words pre-processing using a normalization technique called stemming.
The idea is to reduce each word to its stem, using the stemming algorithm (rule-based heuristic).
For example a stemmer reduce words like "climber", "climbed" and "climbing" to "climb".

The Natural Language Toolkit for Python (NLTK, http://www.nltk.org) implements the Snowball stemming algorithm.

In [6]:
# Apply advance tokenization
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

# create a function to 
def tokenizer_snowballStemmer(text):
    return [stemmer.stem(word) for word in text.split()]

tokenizer_snowballStemmer("The pink sweater fit her perfectly") # test

count = CountVectorizer(tokenizer = tokenizer_snowballStemmer, min_df=5) # use tokenizer function
count.fit(df_D.loc[:,'X']) # generate Bag-of-words

print("Vocabulary size using stemming: {}". format(len(count.vocabulary_)))
#print("Vocabulary content:\n {}".format(count.vocabulary_))



Vocabulary size using stemming: 1610


## Modeling

Shuffle and divide the data in Training and Test set.

In [7]:
from sklearn.model_selection import train_test_split

# split data
X_train, X_test, y_train, y_test = train_test_split(
    df_D.loc[:,'X'], df_D.loc[:,'y'], test_size=0.3, random_state=0, shuffle=True)

Then apply the previous pre-processing methods for text. The transformation is applied to the Training set and then the Test set.

In [8]:
# generate bag of word model
count = CountVectorizer(tokenizer=tokenizer_snowballStemmer, min_df=5).fit(X_train)

# apply transformation to the data
X_train_bow = count.transform(X_train)
print("X_train: {}".format(X_train_bow.shape))

X_test_bow = count.transform(X_test)
print("X_test: {}".format(X_test_bow.shape))

X_train: (2100, 1162)
X_test: (900, 1162)


Fit a Random Forest model on the Training set.

In [9]:
from sklearn.ensemble import RandomForestClassifier

# fit a Random Forest model
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_bow, y_train)

Check the model performance using cross validation (on the training set) and show the accuracy results.

In [10]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

# plot CV results
scores = cross_val_score(clf, X_train_bow, y_train, cv=5, scoring='accuracy')
display("Cross Validation scores")
i=1
for a in scores:
    display("Accuracy cv=" + str(i) + ": " + str(round(a*100, 2)))
    i = i+1

display("Accuracy MEAN: "+str(round(scores.mean(),2)*100))

'Cross Validation scores'

'Accuracy cv=1: 92.38'

'Accuracy cv=2: 92.38'

'Accuracy cv=3: 90.48'

'Accuracy cv=4: 93.1'

'Accuracy cv=5: 90.95'

'Accuracy MEAN: 92.0'

Now the model is tested on the Test set. Are then plotted the accuracy and the confusion matrix.

In [11]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# results on test set
predictions = clf.predict(X_test_bow)
test_accuracy = accuracy_score(predictions, y_test)
display("Test set accuracy: " + str(round(test_accuracy*100, 2)))

display(confusion_matrix(predictions, y_test)) # display confusion matrix

'Test set accuracy: 92.78'

array([[284,  11,   5],
       [  9, 287,   8],
       [  5,  27, 264]])

## Pre-processing (point 2)
