# Basic introduction to Ski-Kit Learn: Analyzing Twitter tweets

## A. Connecting to Twitter API and getting Twitter credentials:

1. Create a Twitter account on https://twitter.com.
2. Go to https://developer.twitter.com/en/apps and log in with your Twitter account.
3. Click “Create an App” and fill in details of the application (See Minerva for a more detailed explanation).
4. The application’s tokens and keys are available in the “Keys and Access Tokens” tab.


To gather Tweets, we will use the [Tweepy](https://github.com/tweepy/tweepy) library. To install Tweepy, open the command prompt as an administrator and run the following pip command: **pip install tweepy**.

In [None]:
import tweepy

# Consumer keys and access tokens, used for OAuth
consumer_key = <FILL_IN_AS_STRING>          
consumer_secret = <FILL_IN_AS_STRING>    
access_token = <FILL_IN_AS_STRING>            
access_token_secret = <FILL_IN_AS_STRING>  

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) 
auth.set_access_token(access_token, access_token_secret)

# Calling the api
api = tweepy.API(auth) 

Now we can interact with the API. For this lab, we will deal with gathering and analyzing tweets from certain Twitter accounts.
Statuses posted by a specified user can be collected with a [GET statuses/user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline) request.
This boils down to a Tweepy user_timeline method.
Have a look at the [syntax](http://docs.tweepy.org/en/v3.5.0/api.html) and possibilities.
Let's take a look at the latest 5 Tweets by @Bart_DeWever (a Belgian politician): 

In [None]:
handle = '@Bart_DeWever'
number_of_tweets = 5
tweets = api.user_timeline(screen_name = handle, count = number_of_tweets)

print(type(tweets))
print(type(tweets[0]))

Note that tweets is a set of Tweepy **'Status' objects**. Take a look [here](https://gist.github.com/jaymcgrath/367c521f1dd786bc5a05ec3eeeb1cb04) to see all the attributes and subattributes of the object.
For example, we can get the text and author's name of the latest tweet by using the following commands:

In [None]:
print(tweets[0].text)
print(tweets[1].author.name)

**Task**: Find the author's location. How many retweets did his latest post have? Check his Twitter page to verify your answer. 

In [None]:
location = <FILL_IN>
retweets = <FILL_IN>

print('The author\'s location is '+location)
print('His latest post has '+repr(retweets)+' retweets')

**Retweets** can be excluded by adding the parameter '**include_rts = False**' to the user_timeline command. Note however that the count parameter still counts retweets, and hence usually len(tweets) < number_of_tweets.

**Task**: Out of his latest 10 posts, print all his original Tweets (= a Tweet that is not a retweet).

In [None]:
number_of_tweets = 10
#<FILL_IN>

## B. SciKit Learn and Machine Learning in practice: an introduction 
The **problem** is as follows: We want to analyze the latest tweets by the leaders of the Belgian political parties. Based on their chosen language, the topics they discuss, the hashtags etc., we would like to see if we can predict which Tweet belongs to who. __Text classification__ is a common Machine Learning task, and we expect that our problem should have a somewhat decent solution. Indeed, we can imagine a right-wing politician and a left-wing/green politican to tweet differently about ongoing events or political topics.

First, install the following packages: **pip install sklearn**, **pip install nltk** and **pip install numpy**

### 1. Bag-of-words Model
In our model, each datapoint consists of the text of a Tweet and an associated label (= the Author's name). However, in order to run some generic black-box classifier, we need to convert this data to numeric values.
The **labels** can easily be converted by just assigning an integer value to each Author. The text can be converted to numeric **features** by using the well-known [Bag-of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model. Basically, we keep track of all the unique words in all the Tweets, and for each unique word, we count how many times it occurs in each Tweet.

In scikit-learn, this is done by a **vectorizer**, see Section 4.2.3 [here](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation). Let's apply this to our corpus of Tweets.

In [None]:
# Importing packages.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re
import numpy as np
import nltk

# Twitter handles of the politicians.
handles = ['@Bart_DeWever', '@conner_rousseau', '@RuttenGwendolyn', '@MeyremAlmaci', '@tomvangrieken']
number_of_tweets = 200
corpus = [] # a list of the text in all the tweets from the handles. 
labels = [] # a list with the numeric label of each tweet (1='@Bart_DeWever', 2='@wbeke', etc.)
teller = 1

# Getting the Tweets from the handles, excluding Retweets and getting full_text from the tweets.
for name in handles:
    tweets = api.user_timeline(screen_name = name, count = number_of_tweets, include_rts = False, tweet_mode="extended")
    corpus = corpus + [t.full_text for t in tweets] # getting the full text tweets.
    labels = labels + [teller for i in range(0,len(tweets))]
    teller +=1
    
# print(corpus[0:5]) # Printing the first tweets in the corpus. 

In [None]:
# Now we convert this corpus to a numeric matrix X:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

#print(vectorizer.get_feature_names()) #__Checking the words that are vectorized, and thus part of the model.
#print(X.shape) #__ Checking the dimension of the matrix.


### 2. Applying generic Machine Learning classifiers
Given our datapoints (the rows in X) and the labels, we are ready to run some basic classifiers. First, in order to evaluate our models properly, we need to split our data into **training data** and **test data**.

**TASK**: use the 'train_test_split' function to split the data automatically. Make sure this is done in a **random** manner.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = <FILL_IN>



Now can train our models on the train data, and evaluate them on the test data. We will start by applying a [Decision Tree Classifier](https://scikit-learn.org/stable/modules/tree.html)  and a [Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html). To check **accuracy**, we simply count the number of times the classifier correctly predicted the label on the test data. 

In [None]:
from sklearn import tree
from sklearn.naive_bayes import MultinomialNB

# Decision Tree Classifier

model_tree = tree.DecisionTreeClassifier(random_state=1) # Setting random_state = 1, to remove randomness in fitting.
model_tree = model_tree.fit(X_train, y_train)
y_predict = model_tree.predict(X_test)
accuracy_tree = <FILL_IN>
print('Decision Tree accuracy:'+'\t'+repr(accuracy_tree))

# Naive Bayes Classifier:

model_bayes = MultinomialNB()
model_bayes = model_bayes.fit(X_train, y_train)
y_predict = model_bayes.predict(X_test)
accuracy_bayes = <FILL_IN>
print('Naive Bayes accuracy:'+'\t'+repr(accuracy_bayes))

Now that we have a model, we will try to improve it. The first thing we can try, is to **redefine our corpus**.
There are two problems with our current corpus:

   a. There are many tweets with **urls** in them, which is clearly not informative in the classification.
   
   b. We have not removed the **stopwords** from the corpus (= very common words without much meaning).

**TASK**: Make a new corpus, where the urls + Dutch stopwords are removed. Test the difference in performance of the classifiers. Explain why the performance, at first sight, might not have improved.



In [None]:
# Download the Dutch stop words from the NLTK repository.
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords

## Removing the links from the tweets (starting with https:// until a space)
# HINT: iterate over the corpus, and use re.sub function to replace 'https' pieces with an empty string ''. 
corpus_no_url = <FILL_IN>
    
## Removing stopwords from Dutch language
# HINT: Split the words in a tweet, only keep words that are not in stopWords, then join the seperate words into a string.
stopWords = set(stopwords.words('dutch'))
corpus_no_stops_no_url = <FILL_IN>

# print(corpus_no_stops_no_url[0:5])

In [None]:
# Performance testing with the new corpus
<FILL_IN>

print('Decision Tree accuracy with new corpus:'+'\t'+<FILL_IN>)

print('Naive Bayes accuracy: with new corpus'+'\t'+<FILL_IN>)

Give a couple of reasons why the accuracy might not be higher than with the original corpus:
<FILL_IN>