# Assignment 9

In this assigment you will implement

 * the precision score
 * a text classifier for german parliament speeches

## Assignment 01

Implement a function assignment_01 that computes the precision of binary predictions:

$${\text{precision}}={\frac {|\{{\text{relevant instances}}\}\cap \{{\text{predicted instances}}\}|}{|\{{\text{predicted instances}}\}|}}$$


The function should expect the true and predicted binary categories as numpy vectors, meaning numpy arrays with only one axis as e.g. ``np.array([1,0])`` where 1 stands for positive prediction and 0 for negative prediction. Make sure that always a number is returned and not a NaN.

In [37]:
import numpy as np

def assignment_01(y_true, y_predicted):
    # INSERT CODE
    y_true = np.array(y_true)
    y_predicted = np.array(y_predicted)
    
    # Ensure that both arrays have the same shape
    if y_true.shape != y_predicted.shape:
        raise ValueError("Input arrays must have the same shape.")
    
    # Calculate the intersection of relevant and predicted instances
    intersection = np.sum((y_true == 1) & (y_predicted == 1))
    
    # Calculate the number of predicted instances
    num_predicted = np.sum(y_predicted == 1)
    
    # Calculate precision, handle the case when num_predicted is 0
    precision = intersection / num_predicted if num_predicted != 0 else 0
    
    # Ensure that the result is a number and not NaN
    return float(precision)

assert assignment_01(np.array([1,1,0]),np.array([0,0,0])) == 0
assert assignment_01(np.array([1,1,0]),np.array([1,1,0])) == 1
assert assignment_01(np.array([1,1,0]),np.array([1,0,0])) == 1
assert assignment_01(np.array([1,1,0]),np.array([0,1,1])) == .5

## Assignment 02

In the 17th Bundestag elected in 2009, the ruling parties were CDU/CSU and FDP, in the 18th Bundestag elected in 2013 the ruling parties were CDU/CSU and SPD. Download the [parliament speeches](https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1) and compute a new target variable 'government' that is true if the respective party was in the ruling coalition at the time. 

Write a function ``assignment_02`` that preprocesses the data and trains a text classification pipeline that predicts whether a speech was made by the governing party. Train the pipeline on the speeches of the 17th Bundestag and test them on (heldout) data from the 17th Bundestag as well as on data from the 18th Bundestag. 

In [47]:
import os, gzip
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import urllib.request
from sklearn.metrics import classification_report

DATADIR = "data"

def load_data():
    if not os.path.exists(DATADIR): 
        os.mkdir(DATADIR)

    file_name = os.path.join(DATADIR, 'bundestags_parlamentsprotokolle.csv.gzip')
    if not os.path.exists(file_name):
        url_data = 'https://www.dropbox.com/s/1nlbfehnrwwa2zj/bundestags_parlamentsprotokolle.csv.gzip?dl=1'
        urllib.request.urlretrieve(url_data, file_name)

    df = pd.read_csv(gzip.open(file_name), index_col=0).sample(frac=1)
    df.loc[df.wahlperiode==17,'government'] = df[df.wahlperiode==17].partei.isin(['cducsu','fdp']).astype(str)
    df.loc[df.wahlperiode==18,'government'] = df[df.wahlperiode==18].partei.isin(['cducsu','spd']).astype(str)
    
    return df


def assignment_02():
    df = load_data()
    
    # Put some data aside for model evaluation
    X, y = df.loc[df.wahlperiode==17,'text'], df.loc[df.wahlperiode==17,'government']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=5912)
    
    # Create a pipeline with a TfidfVectorizer and SGDClassifier
    text_clf = Pipeline([
        ('vect', TfidfVectorizer()),
        ('clf', SGDClassifier())
    ])
    
    # Hyperparameters for grid search
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
        'clf__alpha': (np.logspace(-5, 2, 5)).tolist()
    }
    
    # Perform grid search to find the best hyperparameters
    clf = GridSearchCV(text_clf, parameters, cv=2, n_jobs=-1,verbose=0)
    
    # Train the model on the training set
    clf.fit(X_train, y_train)
    
    print("*"*80 + "\nEvaluation on 17th Bundestag held-out data")
    print(classification_report(y_test, clf.predict(X_test)))

    predictions = clf.predict(df.loc[df.wahlperiode==18,'text'])
    print("*"*80 + "\nEvaluation on 18th Bundestag held-out data")
    print(classification_report(df.loc[df.wahlperiode==18,'government'], predictions))

assignment_02()

********************************************************************************
Evaluation on 17th Bundestag held-out data
              precision    recall  f1-score   support

       False       0.85      0.87      0.86      3190
        True       0.84      0.82      0.83      2722

    accuracy                           0.84      5912
   macro avg       0.84      0.84      0.84      5912
weighted avg       0.84      0.84      0.84      5912

********************************************************************************
Evaluation on 18th Bundestag held-out data
              precision    recall  f1-score   support

       False       0.56      0.91      0.70      6009
        True       0.95      0.70      0.81     14025

    accuracy                           0.76     20034
   macro avg       0.76      0.80      0.75     20034
weighted avg       0.83      0.76      0.77     20034

