# Exercise 1: Working with textual data

### 0. Get the data.

- Download the dataset from https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk. This is a subset of the data from https://doi.org/10.7910/DVN/YHWTFC. 

- Unpack it. On Linux and MacOS, you can do this with `tar -xzf mydata.tar.gz` on the command line. On Windows, you may need an additional tool such as `7zip` for that (note that technically speaking, there is a `tar` archive within a `gz` archive, so unpacking may take *two* steps depending on your tool).


### 1. Inspect the structure of the dataset.
What information do the following elements give you?

- folder (directory) names
- folder structure/hierarchy
- file names
- file contents



### 2. Discuss strategies for working with this dataset!

- Which questions could you answer?
- How could you deal with it, given the size and the structure?
- How much memory<sup>1</sup> (RAM) does your computer have? How large is the complete dataset? What does that mean?
- Make a sketch (e.g., with pen&paper), how you could handle your workflow and your data to answer your question.

<sup>1</sup> *memory* (RAM), not *storage* (harddisk)!

### 3. Read some (or all?) data

Here is some example code that you can modify. Assuming that the folder `articles` is in the same folder as the notebook you are currently working on, you could, for instance, do the following to read a *part* of your dataset.

```python
from glob import glob
infowarsfiles = glob('articles/*/Infowars/*')
infowarsarticles = []
for filename in infowarsfiles:
    with open(filename) as f:
	    infowarsarticles.append(f.read())

```

- Can you explain what the `glob` function does?
- What does `infowarsfiles` contain, and what does `infowarsarticles` contain? First make an educated guess based on the code snippet, then check it! Do *not* print the whole thing, but use `len`, `type` en slicing `[:10]` to get the info you need.

- Tip: take a random sample of the articles for practice purposes (if your code works, you can scale up!)

```
# taking a random sample of the articles for practice purposes
articles =random.sample(infowarsarticles, 10)
```



In [None]:
%pip install nltk

In [10]:
from glob import glob
import os  
import random
import nltk 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

data_dir = r"C:/Data Management/Gesis IML/articles-small" #adjust this to your data directory

In [3]:
infowarsfiles = glob(os.path.join(data_dir, 'articles/*/Infowars/*'))
infowarsarticles = []
for filename in infowarsfiles:
    with open(filename) as f:
	    infowarsarticles.append(f.read())

In [4]:
# taking a random sample of the articles for practice purposes
articles = random.sample(infowarsarticles, 10)

### 4. Vectorize the data

Imagine you want to train a classifier that will predict whether articles come from a fake news source (e.g., `Infowars`) or a quality news outlet (e.g., `bbc`). In other words, you want to predict `source` based on linguistic variations in the articles.

To arrive at a model that will do just that, you have to transform 'text' to 'features'.

- Can you vectorize the data? Try defining different vectorizers. Consider the following options:
    - `count` vs. `tfidf` vectorizers
    - with/ without pruning
    - with/ without stopword removal

In [6]:
infowarsarticles[:2]

['A high school in Vermonts capital raised a Black Lives Matter flag Thursday morning in honor of Black History Month.\n\nStudents at Montpelier High School, where 18 of 350 students are black, took turns raising the flag in a ceremony attended by hundreds of students, staff and community members.\n\nThe decision to fly the flag had drawn criticism from Republican state legislator Thomas Terenzini, who told WPTZ this week that the school was setting a bad example.',
 'A Monmouth Poll released on Wednesday that shows the heavily hyped Democratic generic Congressional ballot advantage has virtually disappeared is the latest poll indicating Democratic chances for major Congressional midterm gains are trending down.\n\nOn December 22, the Real Clear Politics Average of Polls gave Democrats a 13-point advantage in the generic Congressional ballot.\n\nBreitbart News reported last month  when the January 19 Real Clear Politics Average of Polls gave Democrats a 7.8-point advantage in the gener

In [28]:
cnt_vectorizer = CountVectorizer()
cnt_vectorizer.fit(infowarsarticles)
#find the most frequent words
mystopwords = cnt_vectorizer.get_feature_names_out()[cnt_vectorizer.transform(infowarsarticles).sum(axis=0).A1.argsort()[::-1][:50]].tolist()
print(mystopwords)

['the', 'to', 'of', 'and', 'in', 'that', 'is', 'on', 'for', 'was', 'with', 'trump', 'as', 'he', 'it', 'by', 'are', 'at', 'this', 'have', 'said', 'be', 'his', 'not', 'an', 'has', 'they', 'who', 'from', 'president', 'you', 'we', 'about', 'their', 'were', 'people', 'but', 'or', 'its', 'out', 'will', 'after', 'which', 'one', 'would', 'been', 'all', 'she', 'if', 'her']


In [29]:
cnt_vectorizer_stop = CountVectorizer(stop_words=mystopwords)
cnt_vectorizer_stop75_2 = CountVectorizer(stop_words=mystopwords, max_df=.75, min_df=2)
cnt_vectorizer_stop75_2_tree = CountVectorizer(tokenizer=nltk.TreebankWordTokenizer().tokenize, stop_words=mystopwords, max_df=.75, min_df=2,
    token_pattern=None)


### 5. Fit a classifier

- Try out a simple supervised model. Find some inspiration [here](possible-solution-exercise-day1.md). Can you predict the `source` using linguistic variations in the articles?

- Which combination of pre-processing steps + vectorizer gives the best results?

In [13]:
def read_data(listofoutlets):
    texts = []
    labels = []
    for label in listofoutlets:
        for file in glob(os.path.join(data_dir, 'articles', '*', label, '*')):
            with open(file) as f:
                texts.append(f.read())
                labels.append(label)
    return texts, labels

X, y = read_data(['Infowars', 'The Guardian']) #choose your own newsoutlets

In [14]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2)  

In [34]:
#compare performance of different vectorizers:

vectorizers = { "count": CountVectorizer(), 
               "count_stop": CountVectorizer(stop_words=mystopwords), 
               "count_pruned": CountVectorizer(stop_words=mystopwords, max_df=0.6, min_df=5), 
               "count_treebank": CountVectorizer(tokenizer=nltk.TreebankWordTokenizer().tokenize, token_pattern=None), 
               "tfidf": TfidfVectorizer(max_df=0.6, min_df=5), 
               "tfidf_bigrams": TfidfVectorizer(ngram_range=(1,2), max_df=0.6, min_df=5) }

for vectorizer_name, vectorizer in vectorizers.items():
    model = MultinomialNB()
    X_features_train = vectorizer.fit_transform(X_train)
    X_features_test = vectorizer.transform(X_test)
    model.fit(X_features_train, y_train)
    y_pred = model.predict(X_features_test)

    print(f"Accuracy {vectorizer_name}: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))

Accuracy count: 0.835
              precision    recall  f1-score   support

    Infowars       0.94      0.72      0.82       414
The Guardian       0.76      0.95      0.85       386

    accuracy                           0.83       800
   macro avg       0.85      0.84      0.83       800
weighted avg       0.86      0.83      0.83       800

Accuracy count_stop: 0.8275
              precision    recall  f1-score   support

    Infowars       0.93      0.72      0.81       414
The Guardian       0.76      0.94      0.84       386

    accuracy                           0.83       800
   macro avg       0.84      0.83      0.83       800
weighted avg       0.85      0.83      0.83       800

Accuracy count_pruned: 0.8425
              precision    recall  f1-score   support

    Infowars       0.90      0.79      0.84       414
The Guardian       0.80      0.90      0.85       386

    accuracy                           0.84       800
   macro avg       0.85      0.84      0.84     

### BONUS: Inceasing efficiency + reusability
The approach under (3) gets you very far.
But for those of you who want to go the extra mile, here are some suggestions for further improvements in handling such a large dataset, consisting of thousands of files, and for deeper thinking about data handling:

- Consider writing a function to read the data. Let your function take three parameters as input, `basepath` (where is the folder with articles located?), `month` and `outlet`, and return the articles that match this criterion.
- Even better, make it a *generator* that yields the articles instead of returning a whole list.
- Consider yielding a dict (with date, outlet, and the article itself) instead of yielding only the article text.
- Think of the most memory-efficient way to get an overview of how often a given regular expression R is mentioned per outlet!
- Under which circumstances would you consider having your function for reading the data return a pandas dataframe?