# Basic introduction to Ski-Kit Learn: Analyzing Twitter tweets

## A. Connecting to Twitter API and getting Twitter credentials:

1. Create a Twitter account on https://twitter.com.
2. Go to https://developer.twitter.com/en/apps and log in with your Twitter account.
3. Click “Create an App” and fill in details of the application (See Minerva for a more detailed explanation).
4. The application’s tokens and keys are available in the “Keys and Access Tokens” tab.


To gather Tweets, we will use the [Tweepy](https://github.com/tweepy/tweepy) library. To install Tweepy, open the command prompt as an administrator and run the following pip command: **pip install tweepy**.

In [1]:
!pip install tweepy



In [0]:
import tweepy

# Consumer keys and access tokens, used for OAuth
consumer_key = 'removed keys for security reasons'
consumer_secret = 'removed keys for security reasons'
access_token = 'removed keys for security reasons'
access_token_secret = 'removed keys for security reasons'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Calling the api
api = tweepy.API(auth)

Now we can interact with the API. For this lab, we will deal with gathering and analyzing tweets from certain Twitter accounts.
Statuses posted by a specified user can be collected with a [GET statuses/user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline) request.
This boils down to a Tweepy user_timeline method.
Have a look at the [syntax](http://docs.tweepy.org/en/v3.5.0/api.html) and possibilities.
Let's take a look at the latest 5 Tweets by @Bart_DeWever (a Belgian politician): 

In [3]:
handle = '@Bart_DeWever'
number_of_tweets = 5
tweets = api.user_timeline(screen_name=handle, count=number_of_tweets)

print(type(tweets))
print(type(tweets[0]))

<class 'tweepy.models.ResultSet'>
<class 'tweepy.models.Status'>


Note that tweets is a set of Tweepy **'Status' objects**. Take a look [here](https://gist.github.com/jaymcgrath/367c521f1dd786bc5a05ec3eeeb1cb04) to see all the attributes and subattributes of the object.
For example, we can get the text and author's name of the latest tweet by using the following commands:

In [4]:
print(tweets[0].text)
print(tweets[1].author.name)

In Antwerpen nam onze politie de eerste tegen woekerprijs aangeboden mondmaskers in beslag en bracht die naar een z… https://t.co/SdgmO082pE
Bart De Wever


**Task**: Find the author's location. How many retweets did his latest post have? Check his Twitter page to verify your answer. 

In [5]:
location = tweets[0].author.location
retweets = tweets[0].retweet_count

print('The author\'s location is '+location)
print('His latest post has '+repr(retweets)+' retweets')

The author's location is Antwerpen
His latest post has 218 retweets


**Retweets** can be excluded by adding the parameter '**include_rts = False**' to the user_timeline command. Note however that the count parameter still counts retweets, and hence usually len(tweets) < number_of_tweets.

**Task**: Out of his latest 10 posts, print all his original Tweets (= a Tweet that is not a retweet).

In [6]:
number_of_tweets = 10
tweets = api.user_timeline(screen_name=handle, count=number_of_tweets, include_rts=False)
for i, tweet in enumerate(tweets):
  print(f'{i}\t: {tweet.text}')

0	: In Antwerpen nam onze politie de eerste tegen woekerprijs aangeboden mondmaskers in beslag en bracht die naar een z… https://t.co/SdgmO082pE
1	: Veel steun aan wie dezer dagen werkt in de zorg of distributie, in winkels of supermarkten, en ook aan hulpdiensten… https://t.co/Ddpz0rjvUM
2	: Dit zijn de belangrijkste standpunten die wij verdedigen aan de onderhandelingstafel: 👇
https://t.co/jl1jzH6tQv
3	: We nemen akte van het PS-dictaat. Tegen de wil van de Vlaamse kiezer in moet er voor hen een zo links mogelijke reg… https://t.co/8XfejN5f7D
4	: In de ranking van @FT schuift Antwerpen naar de 2de plaats in de lijst van de beste grote steden om te investeren.… https://t.co/ZXoDZFTzA3
5	: Het unitarisme dat @GLBouchez voorstelt werpt ons terug naar de 19de eeuw. Iedereen heeft recht op een eigen opinie… https://t.co/SWRYjE7gBH
6	: Nooit mag terechte onvrede over het Europees migratiebeleid zich omzetten in wrok naar mensen. In de Vlaamse natie… https://t.co/HNcNkXyzFU
7	: Met een dier

## B. SciKit Learn and Machine Learning in practice: an introduction 
The **problem** is as follows: We want to analyze the latest tweets by the leaders of the Belgian political parties. Based on their chosen language, the topics they discuss, the hashtags etc., we would like to see if we can predict which Tweet belongs to who. __Text classification__ is a common Machine Learning task, and we expect that our problem should have a somewhat decent solution. Indeed, we can imagine a right-wing politician and a left-wing/green politican to tweet differently about ongoing events or political topics.

First, install the following packages: **pip install sklearn**, **pip install nltk** and **pip install numpy**

### 1. Bag-of-words Model
In our model, each datapoint consists of the text of a Tweet and an associated label (= the Author's name). However, in order to run some generic black-box classifier, we need to convert this data to numeric values.
The **labels** can easily be converted by just assigning an integer value to each Author. The text can be converted to numeric **features** by using the well-known [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model. Basically, we keep track of all the unique words in all the Tweets, and for each unique word, we count how many times it occurs in each Tweet.

In scikit-learn, this is done by a **vectorizer**, see Section 4.2.3 [here](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation). Let's apply this to our corpus of Tweets.

In [7]:
!pip install sklearn nltk numpy



In [8]:
# Importing packages.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re
import numpy as np
import nltk

# Twitter handles of the politicians.
handles = ['@Bart_DeWever', '@conner_rousseau', '@RuttenGwendolyn', '@MeyremAlmaci', '@tomvangrieken']
number_of_tweets = 200
corpus = [] # a list of the text in all the tweets from the handles. 
labels = [] # a list with the numeric label of each tweet (1='@Bart_DeWever', 2='@wbeke', etc.)
teller = 1

# Getting the Tweets from the handles, excluding Retweets and getting full_text from the tweets.
for name in handles:
    tweets = api.user_timeline(screen_name=name, count=number_of_tweets, include_rts=False, tweet_mode="extended")
    corpus = corpus + [t.full_text for t in tweets] # getting the full text tweets.
    labels = labels + [teller for i in range(0,len(tweets))]
    teller +=1

print(corpus[0:5]) # Printing the first tweets in the corpus. 

['In Antwerpen nam onze politie de eerste tegen woekerprijs aangeboden mondmaskers in beslag en bracht die naar een ziekenhuis. \n👉 Ziet u zoiets? \n📞 Meld het aub aan de politie @PZAntwerpen ! 👮🏻\u200d♀️👮🏻\u200d♂️', 'Veel steun aan wie dezer dagen werkt in de zorg of distributie, in winkels of supermarkten, en ook aan hulpdiensten, leerkrachten enz. \n\nDraag zorg voor elkaar en denk aan het klavertjevier: was je handen, hou veilige afstand, blijf thuis en nies in je zakdoek of elleboog. 🍀 https://t.co/LGPH43twNr', 'Dit zijn de belangrijkste standpunten die wij verdedigen aan de onderhandelingstafel: 👇\nhttps://t.co/jl1jzH6tQv', 'We nemen akte van het PS-dictaat. Tegen de wil van de Vlaamse kiezer in moet er voor hen een zo links mogelijke regering komen, zonder meerderheid in Vlaanderen. De PS wil zelfs verkiezingen uitlokken, enkel en alleen in de hoop om de N-VA te verzwakken en zichzelf te versterken. https://t.co/U6UaoxeZk4', 'In de ranking van @FT schuift Antwerpen naar de 2de p

In [9]:
# Now we convert this corpus to a numeric matrix X:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names()) #__Checking the words that are vectorized, and thus part of the model.
print(X.shape) #__ Checking the dimension of the matrix.

['000', '03', '04', '0nl9mfedfi', '10', '100', '1000', '1000den', '1099ehhglt', '10n', '11november', '12', '125', '13', '135', '13de', '14', '150', '16fdrwb8wl', '17', '1700', '172', '1813', '18jr', '1944', '1948', '1949', '19de', '1c1gjhoylk', '1e', '1kuwwqcgdv', '1lkbfdgrmh', '1n3zzc0kge', '1xw2h1o7zm', '2000', '2002', '2008', '2010', '2017', '2018', '2019', '2020', '2030', '2060', '21', '21ste', '22', '22maart', '22maart2016', '23zs6rw7cy', '240', '27', '270', '28', '28jxr2pmxt', '2avgikk9vm', '2de', '2e', '2ga2ds0ce6', '2npy5muogv', '2vglmyuj9j', '2xwar5sgk8', '30u', '31', '35', '3d4x7hm1af', '3iwbd2afte', '3qmgoyiecb', '3shmlqnps2', '3tgtlw9ecw', '41vccnukvl', '441', '48fodg1ste', '4ham2bhf9m', '4yygc0j0dg', '50', '500', '51zhzcbhuk', '53edfundoi', '58', '5a7cvpykcn', '5dqxeiag2j', '5fimrtaf25', '5jr', '5nax4zzbyg', '5ninz8l7ka', '5qqzuhk6y4', '5s7jf5dw4d', '5shluijqaz', '5w33yiwpjc', '60', '62jvrivdw5', '63', '65000', '65vlvdhepr', '6bcbtl3d7q', '6evosc7vgl', '6kahrovxvu', '6qkhx

### 2. Applying generic Machine Learning classifiers
Given our datapoints (the rows in X) and the labels, we are ready to run some basic classifiers. First, in order to evaluate our models properly, we need to split our data into **training data** and **test data**.

**TASK**: use the 'train_test_split' function to split the data automatically. Make sure this is done in a **random** manner.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, labels, shuffle=True, random_state=2020)

Now can train our models on the train data, and evaluate them on the test data. We will start by applying a [Decision Tree Classifier](https://scikit-learn.org/stable/modules/tree.html)  and a [Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html). To check **accuracy**, we simply count the number of times the classifier correctly predicted the label on the test data. 

In [11]:
from sklearn import tree
from sklearn.naive_bayes import MultinomialNB

# Decision Tree Classifier
model_tree = tree.DecisionTreeClassifier(random_state=1) # Setting random_state = 1, to remove randomness in fitting.
model_tree = model_tree.fit(X_train, y_train)
y_predict = model_tree.predict(X_test)
accuracy_tree = np.sum(y_predict == y_test)
print('Decision Tree accuracy:'+'\t'+repr(accuracy_tree))

# Naive Bayes Classifier:
model_bayes = MultinomialNB()
model_bayes = model_bayes.fit(X_train, y_train)
y_predict = model_bayes.predict(X_test)
accuracy_bayes = np.sum(y_predict == y_test)
print('Naive Bayes accuracy:'+'\t'+repr(accuracy_bayes))

Decision Tree accuracy:	63
Naive Bayes accuracy:	77


Now that we have a model, we will try to improve it. The first thing we can try, is to **redefine our corpus**.
There are two problems with our current corpus:

   a. There are many tweets with **urls** in them, which is clearly not informative in the classification.
   
   b. We have not removed the **stopwords** from the corpus (= very common words without much meaning).

**TASK**: Make a new corpus, where the urls + Dutch stopwords are removed. Test the difference in performance of the classifiers. Explain why the performance, at first sight, might not have improved.



In [12]:
# Download the Dutch stop words from the NLTK repository.
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
from nltk.corpus import stopwords

## Removing the links from the tweets (starting with https:// until a space)
# HINT: iterate over the corpus, and use re.sub function to replace 'https' pieces with an empty string ''. 
corpus_no_url = [re.sub('https://\S* ', '', tweet) for tweet in corpus]
    
## Removing stopwords from Dutch language
# HINT: Split the words in a tweet, only keep words that are not in stopWords, then join the seperate words into a string.
stopWords = set(stopwords.words('dutch'))
corpus_no_stops_no_url = [' '.join([token for token in tweet.split(' ') if not token in stopWords]) for tweet in corpus_no_url]

print(corpus_no_stops_no_url[0:5])

['In Antwerpen nam onze politie eerste woekerprijs aangeboden mondmaskers beslag bracht ziekenhuis. \n👉 Ziet zoiets? \n📞 Meld aub politie @PZAntwerpen ! 👮🏻\u200d♀️👮🏻\u200d♂️', 'Veel steun dezer dagen werkt zorg distributie, winkels supermarkten, hulpdiensten, leerkrachten enz. \n\nDraag zorg elkaar denk klavertjevier: handen, hou veilige afstand, blijf thuis nies zakdoek elleboog. 🍀 https://t.co/LGPH43twNr', 'Dit belangrijkste standpunten wij verdedigen onderhandelingstafel: 👇\nhttps://t.co/jl1jzH6tQv', 'We nemen akte PS-dictaat. Tegen Vlaamse kiezer hen links mogelijke regering komen, meerderheid Vlaanderen. De PS zelfs verkiezingen uitlokken, enkel alleen hoop N-VA verzwakken zichzelf versterken. https://t.co/U6UaoxeZk4', 'In ranking @FT schuift Antwerpen 2de plaats lijst beste grote steden investeren. De Vlaamse welvaartsmotor draait volle toeren. Wat we doen, we echt beter. Investeringen recordhoogte, optimistisch onze toekomst. https://t.co/Acpg3CfIkx']


In [14]:
# Performance testing with the new corpus
# Now we convert this corpus to a numeric matrix X:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus_no_stops_no_url)

X_train, X_test, y_train, y_test = train_test_split(X, labels, shuffle=True, random_state=2020)

# Decision Tree Classifier
model_tree = tree.DecisionTreeClassifier(random_state=1) # Setting random_state = 1, to remove randomness in fitting.
model_tree = model_tree.fit(X_train, y_train)
y_predict = model_tree.predict(X_test)
accuracy_tree = np.sum(y_predict == y_test)
print('Decision Tree accuracy:'+'\t'+repr(accuracy_tree))

# Naive Bayes Classifier:
model_bayes = MultinomialNB()
model_bayes = model_bayes.fit(X_train, y_train)
y_predict = model_bayes.predict(X_test)
accuracy_bayes = np.sum(y_predict == y_test)
print('Naive Bayes accuracy:'+'\t'+repr(accuracy_bayes))

Decision Tree accuracy:	71
Naive Bayes accuracy:	86


**Give a couple of reasons why the accuracy might not be higher than with the original corpus:**

The accuracy of both the Decision Tree and the Naive Bayes classifier went up.
For the Decision Tree classifier, the accuracy increased from 63 to 71.
For the Naive Bayes classifier, the accuracy increased from 77 to 86.

By removing the URLs and stopwords, a lot of clutter is removed from the dataset.
These did not contribute towards a correct classification and even 'distracted' the classifier from important information.
Now the classifiers can focus on words or word groups that really matter.