# Predicting Trump tweet locations

This is a [Google Colab](https://colab.research.google.com/) notebook, and below is your first line of [Python](https://www.python.org/) code.

Let's run it with `Shift-Enter`, which is pretty much the only keyboard shortcut that you will need for this session:

In [None]:
print("Hello world!")

Hello world!


Well done, and welcome to Python (Python 3, specifically, but let's not get started with Python versions).

Everything below this line is meant to give you an overview of how the Python language works. This session does not really introduce anything new in terms of quantitative methods -- we will be using the same kind of machine learning approach as we did last week with `tidymodels`, which we could also have used to write the following analysis.

### Credits

Today's example analysis [comes from Bernhard Rieder](https://github.com/bernorieder/hybridclassification), at the University of Amsterdam, with a few additions inspired by [Laura Nelson's text analysis tutorials](https://github.com/lknelson/text-analysis-2017). The code loosely follows an [introduction to text analysis](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) provided by the [Scikit-Learn](https://scikit-learn.org/stable/) library.

We are using rather old code on purpose, in order to stick with time-tested functions that have been around in Python for quite a while. Python is an extremely fast-moving language, with thousands of new packages and updates being published every year.

# Loading required packages

Let's start by loading the packages used in our example analysis, using slightly different loading mechanisms that allow us to either

1. Load an entire package under its original name (`os`, `requests`), or
2. Load an entire package under a different, shorter name, as we do below with Pandas (`pd`) and NumPy (`np`), or
3. Load specific functions only, as we do below for a selection of Scikit-Learn (`sklearn`) functions.

In [None]:
# os -- Python module to handle files and folders
import os

# requests -- Python module to handle HTTP calls
import requests

# Pandas -- data manipulation
import pandas as pd

# NumPy -- mathematical functions
import numpy as np

# Scikit-learn -- machine learning
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

## Installing packages locally

Thanks to Google Colab, we are skipping package installation entirely, which is an absolute nightmare in Python due to conflicts between [Python versions, kernels, environments](https://xkcd.com/1987/) and so on.

If you are, however, interested in running this notebook locally, you will need to install Pandas, Requests, NumPy and Scikit-Learn first. The standard way to do so in Python 3 is to use `pip3`, as in

```sh
pip3 install pandas
```

Check your default Python version first, as well as the path to your Python installation, which is probably included in your `PATH` if your machine uses a `.bash_profile` config or something like it:

```sh
python --version
which python
```

You can run these from a notebook like this one by prefixing the lines of code above with `!`, and then run them as you would run Python code.

Another way to possibly save yourself some trouble with package installation is to install something like [Miniconda](https://docs.anaconda.com/free/miniconda/), which comes with batteries included.

Last, note that there is a troubleshooting section at the end of this notebook, for idiots like me who tried all routes of action above and ended up messing with their Python installs.

Alright, let's roll now.

# Loading your example data

Like R, Python looks for files in the working directory:

In [None]:
# get current working directory
os.getcwd()

'/content'

Right now, Google Colab gives you access to a few demo files located in the `sample_data` folder:

In [None]:
# list visible files/folders
os.listdir()

['.config', 'tcat_trump_full.csv', 'sample_data']

For this notebook to work, you will need the [`tcat_trump_full.csv`](https://github.com/bernorieder/hybridclassification/blob/master/tcat_trump_full.csv) dataset, which contains a collection of tweets created by Donald J. Trump between 2016 and 2018. The dataset also comes from Bernhard Rieder, who likely obtained it through [DMI-TCAT](https://github.com/digitalmethodsinitiative/dmi-tcat), the Digital Methods Initiative Twitter Capture and Analysis Toolset, which he contributed in coding.

Here are two ways to upload our example data to the (virtual) `content` folder:

1. Execute the next code chunk, or
2. Use the left-hand-side folder icon to view the directory, and upload the data to its root.

In [None]:
# target file
filename_train = "tcat_trump_full.csv"

# download it if it does not exist
if not (filename_train in os.listdir()):
  r = requests.get("https://raw.githubusercontent.com/bernorieder/hybridclassification/master/tcat_trump_full.csv")
  if r.status_code != 200:
    raise Exception("Failed to get file.")
  with open(os.path.join(os.getcwd(), filename_train), "wb") as f:
    f.write(r.content)

# read training data file
df = pd.read_csv(filename_train, delimiter = ",", encoding = "utf-8")
print("Loaded `" + filename_train + "` (" + str(len(df.index)) + " rows).")

Loaded `tcat_trump_full.csv` (8818 rows).


## On collecting social media data

The tool used to collect the example data used below, [DMI-TCAT](https://www.digitalmethods.net/Dmi/ToolDmiTcat), relied on the first (and now disabled) version of the Twitter API. If you are interested in collecting Twitter data, check whatever is left of [version 2](https://developer.twitter.com/en/docs/twitter-api) of its API, now that Twitter has been ~~destroyed~~ rebranded to X.

DMI-TCAT has been superseded by [4CAT](https://4cat.nl/), which covers more social media sources, but which will also soon get crippled when [CrowdTangle](https://www.crowdtangle.com/), which could be used to collect Facebook data, will [stop in August 2024](https://help.crowdtangle.com/en/articles/9014544-important-update-to-crowdtangle-march-2024). Facebook/Meta currently releases data for academics via a different program, the [Meta Content Library](https://transparency.meta.com/en-gb/researchtools/meta-content-library/).

If you are interested in collecting and analyzing social media data, check out the rest of the resources maintained by the [Digital Methods Initiative](https://wiki.digitalmethods.net/Dmi/ToolDatabase), as well as those maintained by the Sciences Po [médialab](https://medialab.sciencespo.fr/en/tools/), which includes some Web crawlers and scrapers.

## Exploring the dataset

The code above will have created a Pandas dataframe into the `df` object. Given your R background, a Pandas data frame is exactly what you think it is, although the column (variable) types are different, because Python:

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8818 entries, 0 to 8817
Data columns (total 36 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           8818 non-null   int64  
 1   time                         8818 non-null   int64  
 2   created_at                   8818 non-null   object 
 3   from_user_name               8818 non-null   object 
 4   text                         8818 non-null   object 
 5   filter_level                 8818 non-null   object 
 6   possibly_sensitive           2734 non-null   float64
 7   withheld_copyright           1 non-null      float64
 8   withheld_scope               1 non-null      object 
 9   truncated                    8818 non-null   int64  
 10  retweet_count                8818 non-null   int64  
 11  favorite_count               8818 non-null   int64  
 12  lang                         8818 non-null   object 
 13  to_user_name      

Exploring and manipulating datasets with Pandas is [pretty straightforward](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html), and will remind you of how it's done in R. Lots of different functions in Pandas are clearly inspired from the `dplyr` package that you know from using the `tidyverse`.

In the results below, note an interesting characteristic of the Python language -- it starts counting at zero, which is why the first row below is numbered as such. Also notice how Python handles missing values (`NaN` in several columns).

In [None]:
# top 10 rows
df.head(10)

Unnamed: 0,id,time,created_at,from_user_name,text,filter_level,possibly_sensitive,withheld_copyright,withheld_scope,truncated,...,from_user_utcoffset,from_user_timezone,from_user_lang,from_user_tweetcount,from_user_followercount,from_user_friendcount,from_user_favourites_count,from_user_listed,from_user_withheld_scope,from_user_created_at
0,696329245866512384,1454852804,2016-02-07 13:46:44,realDonaldTrump,I will be on Meet the Press with Chuck Todd on...,none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
1,696346322735988736,1454856875,2016-02-07 14:54:35,realDonaldTrump,.@ABCPolitics #GOPDebate #MakeAmericaGreatAgai...,none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
2,696370121154150400,1454862549,2016-02-07 16:29:09,realDonaldTrump,Great to meet everyone while having breakfast ...,none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
3,696424442717601792,1454875500,2016-02-07 20:05:00,realDonaldTrump,We are going to have a big event at the Verizo...,none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
4,696428451977433088,1454876456,2016-02-07 20:20:56,realDonaldTrump,Thank you Newt! https://t.co/6FkwdpI0Oj,none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
5,696447301783592961,1454880950,2016-02-07 21:35:50,realDonaldTrump,"Thank you- Plymouth, New Hampshire! #FITN #NHP...",none,0.0,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
6,696463477104365568,1454884807,2016-02-07 22:40:07,realDonaldTrump,I am in New Hampshire having a great time! Lov...,none,,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
7,696514666843996160,1454897011,2016-02-08 02:03:31,realDonaldTrump,So far the Super Bowl is very boring - not nea...,none,,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
8,696665030549487616,1454932861,2016-02-08 12:01:01,realDonaldTrump,"My two wonderful sons, Don and Eric, will be o...",none,,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38
9,696669855416766467,1454934011,2016-02-08 12:20:11,realDonaldTrump,Jeb Bush has zero communication skills so he s...,none,,,,0,...,-14400.0,Eastern Time (US & Canada),en,33872,12843276,41,44,40746,,2009-03-18 13:46:38


## Preparing the dataset

Selecting columns, filtering rows (subsetting) and so on is also straightforward with Pandas. Since our example analysis will focus on predicting the location from which Trump sent his tweets, let's restrict the data to just the few columns that we will be using:

In [None]:
# select columns of interest
trump = df[["id", "created_at", "text", "location"]]
# inspect result
trump

Unnamed: 0,id,created_at,text,location
0,696329245866512384,2016-02-07 13:46:44,I will be on Meet the Press with Chuck Todd on...,"New York, NY"
1,696346322735988736,2016-02-07 14:54:35,.@ABCPolitics #GOPDebate #MakeAmericaGreatAgai...,"New York, NY"
2,696370121154150400,2016-02-07 16:29:09,Great to meet everyone while having breakfast ...,"New York, NY"
3,696424442717601792,2016-02-07 20:05:00,We are going to have a big event at the Verizo...,"New York, NY"
4,696428451977433088,2016-02-07 20:20:56,Thank you Newt! https://t.co/6FkwdpI0Oj,"New York, NY"
...,...,...,...,...
8813,1073965319184678913,2018-12-15 15:37:40,Never in the history of our Country has the “p...,"Washington, DC"
8814,1073974873939169282,2018-12-15 16:15:38,"The pathetic and dishonest Weekly Standard, ru...","Washington, DC"
8815,1073982314588323845,2018-12-15 16:45:12,"Wow, 19,000 Texts between Lisa Page and her lo...","Washington, DC"
8816,1074302851906707457,2018-12-16 13:58:54,"A REAL scandal is the one sided coverage, hour...","Washington, DC"


The dataset contains a fair split of the two locations from which Trump tweeted (New York and Washington):

In [None]:
# list and count distinct values
trump["location"].value_counts()

location
Washington, DC    5488
New York, NY      3330
Name: count, dtype: int64

Last, let's quickly note that missing values will not be an issue in what follows:

In [None]:
# count missing values per column
trump.isnull().sum()

id            0
created_at    0
text          0
location      0
dtype: int64

# Preprocessing the features

The labelled feature (i.e. the variable) that we are interested in predicting is the `location` tag of the tweets.

Let's extract the text of the tweets, our sole predictor, and store it as `X`, and let's also extract our location feature and store it as `Y`, while recoding it to (0, 1), with New York coded as `0` and Washington coded as `1`.

In [None]:
# data selection
X = trump["text"].astype('U')			# text column
Y = trump["location"].astype(str)	# label column

# transform categories into numbers
le = LabelEncoder()
Y = le.fit_transform(Y)

# locations are now coded as (NY = 0, DC = 1)
np.unique(Y, return_counts = True)

(array([0, 1]), array([3330, 5488]))

Let's now split the data in two groups, a training set (containing 80% of the data), which we will use to train our model, and a test set, containing the remaining 20% of the data (tweets):

In [None]:
# cutting the dataset into training (80%) and testing (20%) data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# take a quick look at the training data
X_train

3842    Getting ready to visit Walter Reed Medical Cen...
4254    Thank you!   “Trump’s Defining Speech”  WSJ Ed...
2033    Thank you Green Bay, Wisconsin! Governor @Mike...
3233    Someone incorrectly stated that the phrase "DR...
1244    "@LouDobbs: Hillary Just Handed @realDonaldTru...
                              ...                        
5824    Republicans want to fix DACA far more than the...
8347    Our military is being mobilized at the Souther...
6523    Senator Schumer and Obama Administration let p...
5162    “WHAT HAPPENED”  “How Team Hillary played the ...
3184    I spent Friday campaigning with John Kennedy, ...
Name: text, Length: 7054, dtype: object

The next step consists in extracting the contents of each tweet, and to either simply count the number of occurrences of each word, or to use a weighted count, which is what is called a [tf-idf](https://en.wikipedia.org/wiki/tf%E2%80%93idf) (term frequency–inverse document frequency matrix) in text mining. We will go for the simplest option, as does Bernhard Rieder in his code:

In [None]:
use_tfidf = False
frequency_cutoff = 3

# see: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
if use_tfidf == False:
  # 1- and 2-gram vectorizer
	count_vect = CountVectorizer(ngram_range = (1,2), min_df = frequency_cutoff, stop_words = "english")
else:
  # 1- and 2-gram vectorizer with tf-idf transformation (depending on the data, this may work better or not)
	count_vect = TfidfVectorizer(ngram_range = (1, 2), min_df = frequency_cutoff, stop_words = "english")

# vectorize and weigh training data
X_train_counts = count_vect.fit_transform(X_train)

X_train_counts

<7054x7092 sparse matrix of type '<class 'numpy.int64'>'
	with 97025 stored elements in Compressed Sparse Row format>

The result is a [sparse matrix](https://en.wikipedia.org/wiki/Sparse_matrix), i.e. a matrix that contains lots of zero values, with just a few 1 values when a keyword occurs in a given tweet. Note that we are using more than single words (e.g. `"people"`) as keywords here: we are using [1-grams and 2-grams](https://en.wikipedia.org/wiki/N-gram), which means that combinations of 2 keywords (e.g. `"certain people"`) are also present in the matrix.

In [None]:
# convert n-gram counts to array
count_array = X_train_counts.toarray()
# convert array to a dataframe with column names
count_df = pd.DataFrame(data = count_array, columns = count_vect.get_feature_names_out())
# show a little extract
count_df.iloc[1050:1060,1000:1009]

Unnamed: 0,ceo,ceremony,certain,certain people,certainly,certified,cfpb,chain,chain migration
1050,0,0,0,0,0,0,0,0,0
1051,0,0,0,0,0,0,0,0,0
1052,0,0,0,0,0,0,0,0,0
1053,0,0,0,0,0,0,0,0,0
1054,0,0,0,0,0,0,0,0,0
1055,0,0,0,0,0,0,0,0,0
1056,0,0,0,0,0,0,0,0,0
1057,0,0,0,0,0,0,0,0,0
1058,0,0,0,0,0,0,0,0,0
1059,0,0,0,0,0,0,0,0,0


# Fitting a classifier model

Bernhard Rieder's code gives us two model options at that stage, a [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) classifier and a [Support Vector Machine](https://en.wikipedia.org/wiki/Support_vector_machine) (SVM) classifier. Both should work, but SVM should produce slighty better results.

In [None]:
type_classifier = "svm"

# train the classifier
if type_classifier == "bayes":
	clf = MultinomialNB()
else:
  # note: lots of hyperparameters here
	clf = SGDClassifier(loss = "hinge", penalty = "l2", alpha = 1e-3, random_state = 42, max_iter = 50, tol = 1e-3)

# train the model
clf.fit(X_train_counts,Y_train)

Now that the model is trained, let's repeat the preprocessing step that we took earlier with the testing data, and fit the model to it, and check what it predicted for a short selection of tweets:

In [None]:
# vectorize and weigh test data
X_test_counts = count_vect.transform(X_test)

# apply model to the test data
predicted = clf.predict(X_test_counts)

# create output to get an idea
counter = 0
for doc, category in zip(X_test, predicted):
	print('\n%r \n=> %s' % (doc[0:100] + "...", le.classes_[category]))
	counter += 1
	if(counter > 5):
		break;


'I am on @FoxNews with @greta doing a town hall, from Wisconsin- now! Enjoy! #MakeAmericaGreatAgain #...' 
=> New York, NY

'Could somebody please explain to the Democrats (we need their votes) that our Country losses 250 Bil...' 
=> Washington, DC

'Today, it was my great honor to welcome President Moon Jae-in of the Republic of Korea to the @White...' 
=> Washington, DC

'RT @EricTrump: #Truth @Acosta https://t.co/aCfFoeqL1f...' 
=> Washington, DC

'I hope everyone had a great Memorial Day!...' 
=> Washington, DC

'The "deplorables" came back to haunt Hillary.They expressed their feelings loud and clear. She spent...' 
=> New York, NY


## Assessing model performance

The overall accuracy of the model is only reasonable in this case, but it might still yield some interesting insights if it identified some n-grams that were particularly predictive of the location from which Trump tweeted (i.e. feature importance).

Let's quickly look at both metrics.

In [None]:
# calculate and print accuracy score
print("Accuracy score: %r" % round(accuracy_score(Y_test, predicted), 3))

Accuracy score: 0.872


In [None]:
# show most informative features (may fail if there are sparse labels)
feature_names = count_vect.get_feature_names_out()

no_features = 10

print("\nMost informative features (high to low):\n")
# two paths needed since binary and multiclass result structures are quite different
if len(le.classes_) == 2:
	out = "\t"
	for label in le.classes_:
		out += '%-40s' % label
	print(out)
	coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
	top = zip(coefs_with_fns[:no_features], coefs_with_fns[:-(no_features + 1):-1])
	for (coef_1, fn_1), (coef_2, fn_2) in top:
		print('\t%.4f\t%-17s\t\t%.4f\t%-15s' % (coef_1, fn_1, coef_2, fn_2))
else:
	longest = len(max(le.classes_, key=len))
	for i, class_label in enumerate(le.classes_):
		top = np.argsort(clf.coef_[i])[-no_features:]
		print('{0: <{1}}'.format(class_label, longest)," ".join(feature_names[j] for j in top[::-1]))


Most informative features (high to low):

	New York, NY                            Washington, DC                          
	-1.0316	trump2016        		1.0249	love https     
	-0.9574	teamtrump        		0.9642	rt             
	-0.9507	makeamericagreatagain		0.8900	democrats      
	-0.9507	realdonaldtrump  		0.8159	congratulations
	-0.9372	crookedhillary   		0.7889	president trump
	-0.9170	mike_pence       		0.7889	fake           
	-0.8226	hillary          		0.7484	comey          
	-0.8091	cruz             		0.7484	fake news      
	-0.8024	rt teamtrump     		0.6945	whitehouse     
	-0.7889	clinton          		0.6540	collusion      


… And there you go: you just ran your first machine learning classifier in Python.

This kind of classifier can be used in many scenarios, such as spam detection. As an exercise, try running th code again using a naive Bayes classifier, or more ambitiously, try replacing the classifier with a [logistic regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), or with a [random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

# Additional resources

- [Pandas](https://pandas.pydata.org/) comes with excellent documentation that also covers a fair amount of Python basics, such as [string manipulation](https://pandas.pydata.org/docs/user_guide/text.html).
- [Scikit-Learn](https://scikit-learn.org/) also comes with excellent documentation, especially if you are interested in [working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) as we did here.
- [Going with Python](https://github.com/briatte/dsr/wiki/Going-with-Python) contains a selection of Python books, tutorials and courses, in English and French.

# Troubleshooting

This notebook was first coded using a local Python 3 version installed via [Homebrew](https://brew.sh/), and then tested through the version of [Jupyter](https://jupyter.org/) that can be installed and launched via [Anaconda Navigator](https://docs.anaconda.com/free/navigator/index.html).

## Editing Jupyter kernels

At some point, this setup broke because Python 3 got upgraded, and Jupyter could not find it by following its kernel specs. Jupyter kernels can be listed with the following command:

```sh
jupyter kernelspec list
```

Each path leads to a `kernel.json` file that can be manually edited to point to an available Python installation.

## Force installing packages

Further note that when Python was installed with Homebrew, `pip3` treats it as an external environment and refuses to install packages in it. This can be bypassed, as in

```sh
pip3 install scikit-learn --break-system-packages --user
```

The message that `pip3` will show if you do not include these options will recommend them, while warning that they might break your ~~nerves~~ Python installation.