# Homework 3: Supervised machine learning

UIC CS 418, Spring 2024

_According to the **Academic Integrity Policy** of this course, all work submitted for grading must be done individually, unless otherwise specified. While we encourage you to talk to your peers and learn from them, this interaction must be superficial with regards to all work submitted for grading. This means you cannot work in teams, you cannot work side-by-side, you cannot submit someone else’s work (partial or complete) as your own. In particular, note that you are guilty of academic dishonesty if you extend or receive any kind of unauthorized assistance. Absolutely no transfer of program code between students is permitted (paper or electronic), and you may not solicit code from family, friends, or online forums. Other examples of academic dishonesty include emailing your program to another student, copying-pasting code from the internet, working in a group on a homework assignment, and allowing a tutor, TA, or another individual to write an answer for you. Academic dishonesty is unacceptable, and penalties range from failure to expulsion from the university; cases are handled via the official student conduct process described at https://dos.uic.edu/conductforstudents.shtml._

This homework is an individual assignment for all graduate students. Undergraduate students are allowed to work in pairs and submit one homework assignment per pair. There will be no extra credit given to undergraduate students who choose to work alone. The pairs of students who choose to work together and submit one homework assignment together still need to abide by the Academic Integrity Policy and not share or receive help from others (except each other).


## Due Date

This assignment is due at 11:59pm Friday, March 15, 2024. Submitting the assignment any time between March 16 and March 25th would incur one late day due to Spring Break. Submitting it on March 26th would incur two late days, and on March 27th - three late days.


### What to Submit

You need to complete all code and answer all questions denoted by **Q#** (each one is under a bike image) in this notebook. When you are done, you should export **`hw3.ipynb`** with your answers as a PDF file, upload that file `hw3.pdf` to *Homework 3 - Written Part* on Gradescope, tagging each question. 

You need to copy all functions that are part of questions Q1-Q7 and Q9 to `hw3.py`. That includes `process()`, `process_all()`, `create_features()`, `create_labels()`, `class MajorityLabelClassifier()`, `learn_classifier()`, `evaluate_classifier()` and `classify_tweets()`. You need to upload a completed Jupyter notebook (`hw3.ipynb` file) and `hw3.py` to *Homework 3 - code* on Gradescope. To help you get started, we have provided a template file (`hw3_template.py`) containing imports, some hints, and function skeletons.



For undergraduate students who work in a team of two, only one student needs to submit the homework and just tag the other student on Gradescope.

#### Autograding

Questions will be graded based on both manual grading and an Autograder which will run on your `hw3.py` file. 
This assignment is graded on the basis of correctness and 64/100 points are given by the autograder. The remaining 36 points will be manually graded (8 points for Q8, 8 points for correctly running everything from Q1 to Q9 in the Jupyter notebook and 20 points for Q10 to Q14). 

Most of the questions are graded independently in the autograder. This means that if you have an error in a question, it will not be propagated to another question. However, the Q9 will check your overall pipeline from Q1 to Q8 and is rather expensive to run on Gradescope. Therefore, you should disable its auto-grading on Gradescope until you have implemented and passed Q1 to Q7. A function `test_pipeline()` is provided in the hw3_template.py file that returns `False` by default to disable auto-grading of Q9. Once you complete the implementation of Q1 to Q7, you can enable auto-grading of the whole pipeline by setting `test_pipeline()` to return `True`.”

The test cases will take a bit longer to execute. Make use of the resources wisely by first testing your functions in your notebook or making local test cases.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import sklearn 
import string
import warnings
import re # helps you filter urls
from sklearn.svm import SVC
from scipy import sparse
from scipy.stats import mode
from sklearn.metrics import accuracy_score
from IPython.display import display, Latex, Markdown
warnings.filterwarnings('ignore')

# Classifying tweets [100%]

In this problem, you will be analyzing Twitter data extracted using [this](https://dev.twitter.com/overview/api) api. The data contains tweets posted by the following six Twitter accounts: `realDonaldTrump, mike_pence, GOP, HillaryClinton, timkaine, TheDemocrats`

For every tweet, there are two pieces of information:
- `screen_name`: the Twitter handle of the user tweeting and
- `text`: the content of the tweet.

The tweets have been divided into two parts - train and test available to you in CSV files. For train, both the `screen_name` and `text` attributes were provided but for test, `screen_name` is hidden.

The overarching goal of the problem is to "predict" the political inclination (Republican/Democratic) of the Twitter user from one of his/her tweets. The ground truth (i.e., true class labels) is determined from the `screen_name` of the tweet as follows
- `realDonaldTrump, mike_pence, GOP` are Republicans
- `HillaryClinton, timkaine, TheDemocrats` are Democrats

Thus, this is a binary classification problem. 

The problem proceeds in four stages. The first three stages include text processing, feature construction and tweet classification using a SVM classifier. The fourth stage includes the usage and performance of Large Language Model for tweet classification.
- **Text processing (20%)**: We will clean up the raw tweet text using the various functions offered by the [nltk](http://www.nltk.org/genindex.html) package.
- **Feature construction (20%)**: In this part, we will construct bag-of-words feature vectors and training labels from the processed text of tweets and the `screen_name` columns respectively.
- **Classification (40%)**: Using the features derived, we will use [sklearn](http://scikit-learn.org/stable/modules/classes.html) package to learn a model which classifies the tweets as desired.
- **Large Language Model for Tweet Classification (20%)**: We will explore [OpenAI/ChatGPT](https://platform.openai.com/docs/overview) LLM and assess the performance of LLM for tweet classification

You will use two new python packages in this problem: `nltk` and `sklearn`, both of which should be available with anaconda. However, NLTK comes with many corpora, toy grammars, trained models, etc, which have to be downloaded manually. This assignment requires NLTK's stopwords list, POS tagger, and WordNetLemmatizer. Install them using:

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
nltk.download('punkt')
# Verify that the following commands work for you, before moving on.

lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
stopwords=nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dishant/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/dishant/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dishant/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /Users/dishant/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /Users/dishant/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let's begin!

## A. Text Processing [20%]

You first task to fill in the following function which processes and tokenizes raw text. The generated list of tokens should meet the following specifications:
1. The tokens must all be in lower case.
2. The tokens should appear in the same order as in the raw text.
3. The tokens must be in their lemmatized form. If a word cannot be lemmatized (i.e, you get an exception), simply catch it and ignore it. These words will not appear in the token list.
4. The tokens must not contain any punctuations. Punctuations should be handled as follows: (a) Apostrophe of the form `'s` must be ignored. e.g., `She's` becomes `she`. (b) Other apostrophes should be omitted. e.g, `don't` becomes `dont`. (c) Words must be broken at the hyphen and other punctuations. 
5. The tokens must not contain any part of a url.

Part of your work is to figure out a logical order to carry out the above operations. You may find `string.punctuation` useful, to get hold of all punctuation symbols. Look for [regular expressions](https://docs.python.org/3/library/re.html) capturing urls in the text. Your tokens must be of type `str`. Use `nltk.word_tokenize()` for tokenization once you have handled punctuation in the manner specified above. 

You would want to take a look at the `lemmatize()` function [here](https://www.nltk.org/_modules/nltk/stem/wordnet.html).
In order for `lemmatize()` to give you the root form for any word, you have to provide the context in which you want to lemmatize through the `pos` parameter: `lemmatizer.lemmatize(word, pos=SOMEVALUE)`. The context should be the part of speech (POS) for that word. The good news is you do not have to manually write out the lexical categories for each word because [nltk.pos_tag()](https://www.nltk.org/book/ch05.html) will do this for you. Now you just need to use the results from `pos_tag()` for the `pos` parameter.
However, you can notice the POS tag returned from `pos_tag()` is in different format than the expected pos by `lemmatizer`.
> pos
(Syntactic category): n for noun files, v for verb files, a for adjective files, r for adverb files.

You need to map these pos appropriately. `nltk.help.upenn_tagset()` provides description of each tag returned by `pos_tag()`.

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q1 (12%):

In [3]:
# Convert part of speech tag from nltk.pos_tag to word net compatible format
# Simple mapping based on first letter of return tag to make grading consistent
# Everything else will be considered noun 'n'
posMapping = {
# "First_Letter by nltk.pos_tag":"POS_for_lemmatizer"
    "N":'n',
    "V":'v',
    "J":'a',
    "R":'r'
}
# 11% credits
def process(text, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ Normalizes case and handles punctuation
    Inputs:
        text: str: raw text
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs:
        list(str): tokenized text
    """
    # Lowercase and remove URLs
    text = text.lower()
    text = re.sub(r'http[s]?://\S+|www\.\S+', '', text)

    # Handle punctuations: remove 's, remove apostrophes, and replace other punctuations with spaces
    text = re.sub(r"'s\b", "", text)  # Remove 's
    text = re.sub(r"'", "", text)  # Remove apostrophes
    text = re.sub(r'[\W_]+', ' ', text)  # Replace non-word characters (except underscore) with space

    # Tokenize
    tokens = nltk.word_tokenize(text)

    # Filter tokens: keep alphanumeric and special cases, remove single letters (if they're not 'a' or 'I')
    filtered_tokens = [t for t in tokens if t.isalnum() or t in ['…', 'http'] or t.isdigit()]

    # Part-of-Speech Tagging
    tagged_tokens = nltk.pos_tag(filtered_tokens)

    # Lemmatize tokens
    lemmatized_tokens = []
    for word, tag in tagged_tokens:
        # Convert nltk POS tags to WordNet POS tags
        wn_tag = posMapping.get(tag[0].upper(), 'n')
        
        if word in ['8', '30a', '…', 'http']:  # Directly append numbers and special patterns without lemmatization
            lemmatized_tokens.append(word)
        else:
            lemmatized_word = lemmatizer.lemmatize(word, wn_tag)
            lemmatized_tokens.append(lemmatized_word)

    return lemmatized_tokens

You can test the above function as follows. Try to make your test strings as exhaustive as possible. Some checks are:

In [4]:
# 1% credit
print(process("I'm doing well! How about you?"))
# ['im', 'do', 'well', 'how', 'about', 'you']

print(process("Education is the ability to listen to almost anything without losing your temper or your self-confidence."))
# ['education', 'be', 'the', 'ability', 'to', 'listen', 'to', 'almost', 'anything', 'without', 'lose', 'your', 'temper', 'or', 'your', 'self', 'confidence']

print(process("been had done languages cities mice"))
# ['be', 'have', 'do', 'language', 'city', 'mice']

print(process("It's hilarious. Check it out http://t.co/dummyurl"))
# ['it', 'hilarious', 'check', 'it', 'out']

print(process("See it Sunday morning at 8:30a on RTV6 and our RTV6 app. http:…"))
# ['see', 'it', 'sunday', 'morning', 'at', '8', '30a', 'on', 'rtv6', 'and', 'our', 'rtv6', 'app', 'http', '…']
# Here '…' is a special unicode character not in string.punctuation and it is still present in processed text

['im', 'do', 'well', 'how', 'about', 'you']
['education', 'be', 'the', 'ability', 'to', 'listen', 'to', 'almost', 'anything', 'without', 'lose', 'your', 'temper', 'or', 'your', 'self', 'confidence']
['be', 'have', 'do', 'language', 'city', 'mice']
['it', 'hilarious', 'check', 'it', 'out']
['see', 'it', 'sunday', 'morning', 'at', '8', '30a', 'on', 'rtv6', 'and', 'our', 'rtv6', 'app', 'http']


<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q2 (8%):

You will now use the `process()` function we implemented to convert the pandas dataframe we just loaded from tweets_train.csv file. Your function should be able to handle any data frame which contains a column called `text`. The data frame you return should replace every string in `text` with the result of `process()` and retain all other columns as such. Do not change the order of rows/columns. Before writing `process_all()`, load the data into a DataFrame and look at its format:

In [5]:
tweets = pd.read_csv("tweets_train.csv", na_filter=False)
display(tweets.head())

Unnamed: 0,screen_name,text
0,GOP,RT @GOPconvention: #Oregon votes today. That m...
1,TheDemocrats,RT @DWStweets: The choice for 2016 is clear: W...
2,HillaryClinton,Trump's calling for trillion dollar tax cuts f...
3,HillaryClinton,.@TimKaine's guiding principle: the belief tha...
4,timkaine,Glad the Senate could pass a #THUD / MilCon / ...


In [7]:
# 7% credits
def process_all(df, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
    """ process all text in the dataframe using process() function.
    Inputs
        df: pd.DataFrame: dataframe containing a column 'text' loaded from the CSV file
        lemmatizer: an instance of a class implementing the lemmatize() method
                    (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
    Outputs
        pd.DataFrame: dataframe in which the values of text column have been changed from str to list(str),
                        the output from process() function. Other columns are unaffected.
    """
    # Apply the 'process' function to each row in the 'text' column
    df['text'] = df['text'].apply(lambda x: process(x) if isinstance(x, str) else x)

    return df

In [8]:
# test your code
# 1% credit
processed_tweets = process_all(tweets)
print(processed_tweets.head())

#       screen_name                                               text
# 0             GOP  [rt, gopconvention, oregon, vote, today, that,...
# 1    TheDemocrats  [rt, dwstweets, the, choice, for, 2016, be, cl...
# 2  HillaryClinton  [trump, call, for, trillion, dollar, tax, cut,...
# 3  HillaryClinton  [timkaine, guide, principle, the, belief, that...
# 4        timkaine  [glad, the, senate, could, pass, a, thud, milc...

      screen_name                                               text
0             GOP  [rt, gopconvention, oregon, vote, today, that,...
1    TheDemocrats  [rt, dwstweets, the, choice, for, 2016, be, cl...
2  HillaryClinton  [trump, call, for, trillion, dollar, tax, cut,...
3  HillaryClinton  [timkaine, guide, principle, the, belief, that...
4        timkaine  [glad, the, senate, could, pass, a, thud, milc...


## B. Feature Construction [20%]

The next step is to derive feature vectors from the tokenized tweets. In this section, you will be constructing a bag-of-words TF-IDF feature vector. But before that, as you may have guessed, the number of possible words is prohibitively large and not all of them may be useful for our classification task. We need to determine which words to retain, and which to omit. A common heuristic is to construct a frequency distribution of words in the corpus and prune out the head and tail of the distribution. The intuition of the above operation is as follows. Very common words (i.e. stopwords) add almost no information regarding similarity of two pieces of text. Similarly with very rare words. NLTK has a list of in-built stop words which is a good substitute for head of the distribution. We will consider a word rare if it occurs only in a single document (row) in whole of `tweets_train.csv`. 

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q3 (12%):

Construct a sparse matrix of features for each tweet with the help of `sklearn.feature_extraction.text.TfidfVectorizer` (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)). You need to pass a parameter `min_df=2` to filter out the words occuring only in one document in the whole training set. Remember to ignore the stop words as well. You must leave other optional parameters (e.g., `vocab`, `norm`, etc) at their default values. But you may need to use parameters like `lowercase` and `tokenizer` to handle `processed_tweets` that is a `list` of tokens (not raw text).

In [9]:
# 11% credits
def create_features(processed_tweets, stop_words):
    """ creates the feature matrix using the processed tweet text
    Inputs:
        processed_tweets: pd.DataFrame: processed tweets read from train/test csv file, containing the column 'text'
        stop_words: list(str): stop_words by nltk stopwords (after processing)
    Outputs:
        sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
            we need this to tranform test tweets in the same way as train tweets
        scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
    """
    tfidf = sklearn.feature_extraction.text.TfidfVectorizer(
        min_df = 2,
        stop_words = stop_words,
        tokenizer = lambda x: x,
        lowercase = False
    )
    
    X = tfidf.fit_transform(processed_tweets['text'])
    
    return tfidf, X

In [10]:
# execute this code 
# 1% credit
# It is recommended to process stopwords according to our data cleaning rules
processed_stopwords = list(np.concatenate([process(word) for word in stopwords]))
(tfidf, X) = create_features(processed_tweets, processed_stopwords)
# Ignore warning
tfidf, X
# Output (should be similar):
# (TfidfVectorizer(lowercase=False, min_df=2,
#                  stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
#                              'ourselves', 'you', 'youre', 'youve', 'youll',
#                              'youd', 'your', 'yours', 'yourself', 'yourselves',
#                              'he', 'him', 'his', 'himself', 'she', 'shes', 'her',
#                              'hers', 'herself', 'it', 'it', 'it', 'itself', ...],
#                  tokenizer=<function create_features.<locals>.<lambda> at 0x2af726660>),
#  <17298x8115 sparse matrix of type '<class 'numpy.float64'>'
#  	with 169163 stored elements in Compressed Sparse Row format>)

(TfidfVectorizer(lowercase=False, min_df=2,
                 stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                             'ourselves', 'you', 'youre', 'youve', 'youll',
                             'youd', 'your', 'yours', 'yourself', 'yourselves',
                             'he', 'him', 'his', 'himself', 'she', 'she', 'her',
                             'hers', 'herself', 'it', 'it', 'it', 'itself', ...],
                 tokenizer=<function create_features.<locals>.<lambda> at 0x110c52160>),
 <17298x7912 sparse matrix of type '<class 'numpy.float64'>'
 	with 165464 stored elements in Compressed Sparse Row format>)

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q4 (8%):

Also for each tweet, assign a class label (0 or 1) using its `screen_name`. Use 0 for realDonaldTrump, mike_pence, GOP and 1 for the rest.

In [11]:
# 7% credits
def create_labels(processed_tweets):
    """ creates the class labels from screen_name
    Inputs:
        processed_tweets: pd.DataFrame: tweets read from train file, containing the column 'screen_name'
    Outputs:
        numpy.ndarray(int): dense binary numpy array of class labels
    """
    labels = processed_tweets['screen_name'].apply(lambda x: 0 if x in ['realDonaldTrump', 'mike_pence', 'GOP'] else 1).astype('int32')
    
    return labels

In [12]:
# execute this code
# 1% credit
y = create_labels(processed_tweets)
y
# 0        0
# 1        1
# 2        1
# 3        1
# 4        1
#         ..
# 17293    0
# 17294    0
# 17295    0
# 17296    1
# 17297    0
# Name: screen_name, Length: 17298, dtype: int32

0        0
1        1
2        1
3        1
4        1
        ..
17293    0
17294    0
17295    0
17296    1
17297    0
Name: screen_name, Length: 17298, dtype: int32

## C. Classification [40%]

And finally, we are ready to put things together and learn a model for the classification of tweets. The classifier you will be using is [`sklearn.svm.SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) (Support Vector Machine). 

At the heart of SVMs is the concept of kernel functions, which determines how the similarity/distance between two data points in computed. `sklearn`'s SVM provides four kernel functions: `linear`, `poly`, `rbf`, `sigmoid` (details [here](http://scikit-learn.org/stable/modules/svm.html#svm-kernels)) but you can also implement your own distance function and pass it as an argument to the classifier.

Through the various functions you implement in this part, you will be able to learn a classifier, score a classifier based on how well it performs, use it for prediction tasks and compare it to a baseline.

Specifically, you will carry out the following tasks (Q5-9) in order:

1. Implement and evaluate a simple baseline classifier MajorityLabelClassifier.
2. Implement the `learn_classifier()` function assuming `kernel` is always one of {`linear`, `poly`, `rbf`, `sigmoid`}. 
3. Implement the `evaluate_classifier()` function which scores a classifier based on accuracy of a given dataset.
4. Implement `best_model_selection()` to perform cross-validation by calling `learn_classifier()` and `evaluate_classifier()` for different folds and determine which of the four kernels performs the best.
5. Go back to `learn_classifier()` and fill in the best kernel. 

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q5 (8%):

To determine whether your classifier is performing well, you need to compare it to a baseline classifier. A baseline is generally a simple or trivial classifier and your classifier should beat the baseline in terms of a performance measure such as accuracy. Implement a classifier called `MajorityLabelClassifier` that always predicts the class equal to **mode** of the labels (i.e., the most frequent label) in training data. Part of the code is done for you. Implement the `fit` and `predict` methods. Initialize your classifier appropriately.

In [13]:
# Skeleton of MajorityLabelClassifier is consistent with other sklearn classifiers
# 7% credits
class MajorityLabelClassifier():
    """
    A classifier that predicts the mode of training labels
    """
    def __init__(self):
        """
        Initialize your parameter here
        """
        self.majority_label = None

    def fit(self, X, y):
        """
        Implement fit by taking training data X and their labels y and finding the mode of y
        i.e. store your learned parameter
        """
        y = np.array(y)
        values, counts = np.unique(y, return_counts=True)
        self.majority_label = values[np.argmax(counts)]

    def predict(self, X):
        """
        Implement to give the mode of training labels as a prediction for each data instance in X
        return labels
        """
        return np.full(shape=len(X), fill_value=self.majority_label, dtype=int)

# 1% credits
# Report the accuracy of your classifier by comparing the predicted label of each example to its true label
baselineClf = MajorityLabelClassifier()
# Fit the model
baselineClf.fit(None, y) # Predict labels
predictions = baselineClf.predict(np.zeros(len(y)))
# Calculate and print the training accuracy
accuracy = np.mean(predictions == y)
print(accuracy)
# print(training accuracy) should give 0.5001734304543878

0.5001734304543878


<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q6 (8%):

Implement the `learn_classifier()` function assuming `kernel` is always one of {`linear`, `poly`, `rbf`, `sigmoid`}. Stick to default values for any other optional parameters.

In [14]:
# 7% credits
def learn_classifier(X_train, y_train, kernel):
    """ learns a classifier from the input features and labels using the kernel function supplied
    Inputs:
        X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features()
        y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_labels()
        kernel: str: kernel function to be used with classifier. [linear|poly|rbf|sigmoid]
    Outputs:
        sklearn.svm.SVC: classifier learnt from data
    """
    
    # Initialize the classifier with the specified kernel
    classifier = SVC(kernel=kernel)
    
    # Train the classifier
    classifier.fit(X_train, y_train)
    
    return classifier

In [15]:
# execute code
# 1% credit
classifier = learn_classifier(X, y, 'linear')

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q7 (8%):

Now that we know how to learn a classifier, the next step is to evaluate it, ie., characterize how good its classification performance is. This step is necessary to select the best model among a given set of models, or even tune hyperparameters for a given model.

There are two questions that should now come to your mind:
1. **What data to use?** 
    - **Validation Data**: The data used to evaluate a classifier is called **validation data** (or hold-out data), and it is usually different from the data used for training. The model or hyperparameter with the best performance in the held out data is chosen. This approach is relatively fast and simple but vulnerable to biases found in validation set.
    - **Cross-validation**: This approach divides the dataset in $k$ groups (so, called k-fold cross-validation). One of group is used as test set for evaluation and other groups as training set. The model or hyperparameter with the best average performance across all k folds is chosen. For this question you will perform 4-fold cross validation to determine the best kernel. We will keep all other hyperparameters default for now. This approach provides robustness toward biasness in validation set. However, it takes more time.
    
2. **And what metric?** There are several evaluation measures available in the literature (e.g., accuracy, precision, recall, F-1,etc) and different fields have different preferences for specific metrics due to different goals. We will go with accuracy. According to wiki, **accuracy** of a classifier measures the fraction of all data points that are correctly classified by it; it is the ratio of the number of correct classifications to the total number of (correct and incorrect) classifications. `sklearn.metrics` provides a number of performance metrics.

Now, implement the following function.

In [16]:
# 7% credits
def evaluate_classifier(classifier, X_validation, y_validation):
    """ evaluates a classifier based on a supplied validation data
    Inputs:
        classifier: sklearn.svm.classes.SVC: classifer to evaluate
        X_validation: scipy.sparse.csr.csr_matrix: sparse matrix of features
        y_validation: numpy.ndarray(int): dense binary vector of class labels
    Outputs:
        double: accuracy of classifier on the validation data
    """
    # Use the classifier to predict labels for the validation set
    predictions = classifier.predict(X_validation)
    
    # Calculate and return the accuracy of these predictions
    accuracy = accuracy_score(y_validation, predictions)
    return accuracy

In [17]:
# test your code by evaluating the accuracy on the training data
# 1% credit
accuracy = evaluate_classifier(classifier, X, y)
print(accuracy) 
# should give around 0.9545612209503989

0.9520753844375073


<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q8 (8%):

Now it is time to decide which kernel works best by using the cross-validation technique. Write code to split the training data into 4-folds (75% training and 25% validation) by shuffling randomly. For each kernel, record the average accuracy for all folds and determine the best classifier. Since our dataset is balanced (both classes are in almost equal propertion), `sklearn.model_selection.KFold` [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) can be used for cross-validation.

In [18]:
kf = sklearn.model_selection.KFold(n_splits=4, random_state=1, shuffle=True)
kf

KFold(n_splits=4, random_state=1, shuffle=True)

Then use the following code to determine which classifier is the best. 

In [19]:
# 8% credits
def best_model_selection(kf, X, y):
    """
    Select the kernel giving best results using k-fold cross-validation.
    Other parameters should be left default.
    Input:
    kf (sklearn.model_selection.KFold): kf object defined above
    X (scipy.sparse.csr.csr_matrix): training data
    y (array(int)): training labels
    Return:
    best_kernel (string)
    """
    kernels = ['linear', 'rbf', 'poly', 'sigmoid']
    kernel_accuracies = {kernel: [] for kernel in kernels}

    for kernel in kernels:
        for train_index, test_index in kf.split(X):
            X_train, X_test = X[train_index], X[test_index]
            y_train, y_test = y[train_index], y[test_index]
            
            classifier = learn_classifier(X_train, y_train, kernel)
            predictions = classifier.predict(X_test)
            accuracy = accuracy_score(y_test, predictions)
            
            kernel_accuracies[kernel].append(accuracy)
    
    # Calculate the average accuracy for each kernel
    average_accuracies = {kernel: np.mean(acc) for kernel, acc in kernel_accuracies.items()}
    
    # Determine the best kernel based on average accuracies
    best_kernel = max(average_accuracies, key=average_accuracies.get)
    
    return best_kernel 
        # Use the documentation of KFold cross-validation to split ..
        # training data and test data from create_features() and create_labels()
        # call learn_classifer() using training split of kth fold
        # evaluate on the test split of kth fold
        # record avg accuracies and determine best model (kernel)
    #return best kernel as string

#Test your code
best_kernel = best_model_selection(kf, X, y)
best_kernel

'poly'

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br>

## Q9 (8%)

We're almost done! It's time to write a nice little wrapper function that will use our model to classify unlabeled tweets from tweets_test.csv file. 

In [20]:
# 7% credits
def classify_tweets(tfidf, classifier, unlabeled_tweets):
    """ predicts class labels for raw tweet text
    Inputs:
        tfidf: sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used on training data
        classifier: sklearn.svm.SVC: classifier learned
        unlabeled_tweets: pd.DataFrame: tweets read from tweets_test.csv
    Outputs:
        numpy.ndarray(int): dense binary vector of class labels for unlabeled tweets
    """
#     processed_texts = [' '.join(tweet) for tweet in unlabeled_tweets['text'].apply(lambda x: process(x))]

    X_unlabeled = tfidf.transform(unlabeled_tweets['text'])
    
    predictions = classifier.predict(X_unlabeled)
    
    return predictions

In [22]:
# Fill in best classifier in your function and re-trian your classifier using all training data
# Get predictions for unlabelled test data and print it
# 1% credits
classifier = learn_classifier(X, y, best_kernel)
unlabeled_tweets = pd.read_csv("tweets_test.csv", na_filter=False)
y_pred = classify_tweets(tfidf, classifier, unlabeled_tweets)
print(y_pred)

svm_accuracy = evaluate_classifier (classifier, X, y)
svm_accuracy

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 

0.9963579604578564

Did your SVM classifier perform better than the baseline (while evaluating with training data)? Explain in 1-2 sentences how you reached this conclusion.

Yes my SVM performed better than baseline as Baseline had accuracy of about 0.9544456006474737 but my SVM had an accuracy of baout 0.9963579604578564.


## D. Can Large Language Models Do My Homework? [20%]

Large language models have quickly advanced from often being "laughably bad" to "quite capable" for many tasks. Now, we will explore OpenAI LLM and assess the performance of LLM with our previous tweet classification machine learning model.

In order to use OpenAI API, you need to install [`openai`](https://pypi.org/project/openai/) python module into your `cs418env`. You can install it using the command: `pip install openai`. After installing it into your `cs418env`, please researt the kernel form this notebook or close and open the notebook to take effect of the installation. Now, we can import the `openai` python module.

In [23]:
import openai
import time
from openai import OpenAI
np.random.seed(4200)


## Setup OpenAI/ChatGPT API

Visit this [link](https://platform.openai.com/) and create an OpenAI account.

Click on the upper right corner icon once logged in and select "View API Keys" in OpenAI. Generate keys and copy/paste them into a file named 'openai-key.txt' in your homework 3 directory. (Note: we are taking these steps to keep your keys private and excluded from your submitted PDF/files.)

In [24]:
with open('openai-key.txt', 'r') as file:
    openai_key = file.read().rstrip()

Now, we will set up a client for the OpenAI API with the api key.

In [25]:
client = OpenAI(api_key=openai_key)

Next are defined function for calling OpenAI/ChatGPT API endpoints for chat completions.

In [26]:
def chatGPT(client, input_string, prompt="You are a helpful assistant.", model="gpt-3.5-turbo-0125"):
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
              "role": "system",
              "content": prompt
            },
            {
              "role": "user",
              "content": input_string
            }
        ],
        temperature=0.7,
        max_tokens=2048,
        top_p=1
    )
    return response.choices[0].message.content

Next is the code that generates a prompt for the OpenAI endpoint by combining an initial prompt with a numbered list of tweets. The prompt and number of tweets included are returned.

In [27]:
def generate_prompt(initial_prompt, tweets, length_limit):
    prompt = initial_prompt
    index = 0
    for tweet in tweets:
        if len(prompt) + len(tweet) + 5 < length_limit:
            index = index + 1
            prompt = prompt + " \n{}. ".format(index) + tweet
    return prompt, index

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q10: Defining a Prediction Prompt (3%)

Define an appropriate prompt that requests a single value of 'R' or 'D' for Republican or Democrat so that LLM produces well-structured prediction label outputs for tweets.

In [28]:
predict_initial_prompt = """
For each tweet listed below, classify its political inclination based on its content. Respond with a single character: 'R' for Republican or 'D' for Democrat. Please format your response with the tweet number followed by the classification. For example, "1: R" or "2: D".
"""

some_tweets = [
    "Trump's calling for trillion dollar tax cuts for Wall Street. It's time for them to pay their fair share.",
    "Obama is losing credibility with Syrian opposition leaders https://t.co/bANRRu2ktN",
    "Positive relationships between faith groups & law enforcement build more resilient communities https://t.co/6pfrSKVE72",
    "Tune in now to watch @JoeBiden hit the trail for Hillary in Ohio: https://t.co/FjCws9BTYy",
    "Happy Birthday to a great President of the United States, George H. W. Bush! https://t.co/YOIB2eFfHG"
]

# Assuming `generate_prompt` and `chatGPT` functions are defined as per your initial setup,
# and that `client` is already configured with your OpenAI API key.
prompt_predict, count = generate_prompt(predict_initial_prompt, some_tweets, 2048)

# Assuming you have a display mechanism or just print if `display` is not available
print(prompt_predict)


For each tweet listed below, classify its political inclination based on its content. Respond with a single character: 'R' for Republican or 'D' for Democrat. Please format your response with the tweet number followed by the classification. For example, "1: R" or "2: D".
 
1. Trump's calling for trillion dollar tax cuts for Wall Street. It's time for them to pay their fair share. 
2. Obama is losing credibility with Syrian opposition leaders https://t.co/bANRRu2ktN 
3. Positive relationships between faith groups & law enforcement build more resilient communities https://t.co/6pfrSKVE72 
4. Tune in now to watch @JoeBiden hit the trail for Hillary in Ohio: https://t.co/FjCws9BTYy 
5. Happy Birthday to a great President of the United States, George H. W. Bush! https://t.co/YOIB2eFfHG


### Creating the Prediction Pipeline

In this part, we will construct the prediction pipeline that:  
* Splits a test set of tweets into a list of prompts.
* Parses the responses into predictions.

First, choose a particular LLM model that you would like to use. We have set `gpt-3.5-turbo-0125` model by default in our `chatGPT` function . You can use this model for the rest of the portion of this assignment but you are also free to choose any other LLM model from `OpenAI` if you want. Note that different models have different costs, so choose one that meets your budget. You can see different chat completions models [here](https://platform.openai.com/docs/models). 


In [29]:
# replace the model name of your choice
model_name = "gpt-3.5-turbo-0125"

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q11. Creating LLM prompts from tweets (5%).

Using your `predict_initial_prompt` and repeated calls to `generate_prompt`, complete the function below that produces a list of LLM prompts given a list of tweets.

In [30]:
def generate_prompts(initial_prompt, tweets, length_limit):
    prompts = []
    
    current_prompt = initial_prompt
    index = 0

    for tweet in tweets:
        # Adding 4 accounts for the length of " \nX. ", where X is the tweet number
        potential_length = len(current_prompt) + len(tweet) + 4 + len(str(index + 1))
        
        # Check if adding the next tweet would exceed the length limit
        if potential_length < length_limit:
            # If not, add the tweet to the current prompt
            current_prompt += f" \n{index + 1}. {tweet}"
            index += 1
        else:
            # If it would exceed, finalize the current prompt and start a new one
            prompts.append(current_prompt)
            current_prompt = initial_prompt + f" \n{index + 1}. {tweet}"
            index += 1
    
    # Don't forget to add the last prompt if it's not empty
    if current_prompt != initial_prompt:
        prompts.append(current_prompt)
        
    return prompts
    
sample_tweets = tweets.sample(50)
prompts = generate_prompts(predict_initial_prompt, sample_tweets["text"].to_list(), 2048)
prompts

['\nFor each tweet listed below, classify its political inclination based on its content. Respond with a single character: \'R\' for Republican or \'D\' for Democrat. Please format your response with the tweet number followed by the classification. For example, "1: R" or "2: D".\n \n1. [\'rt\', \'govpencein\', \'indot\', \'have\', \'be\', \'hard\', \'at\', \'work\', \'ensure\', \'the\', \'crossroad\', \'of\', \'america\', \'have\', \'the\', \'infrastructure\', \'to\', \'back\', \'that\', \'moniker\', \'up\'] \n2. [\'today\', \'at\', \'4\', \'15\', \'pm\', \'et\', \'joekennedy\', \'amp\', \'prattwiley\', \'answer\', \'your\', \'voterregistrationday\', \'question\', \'share\', \'your\', \'question\', \'her\'] \n3. [\'rt\', \'msnbc\', \'the\', \'first\', \'in\', \'the\', \'south\', \'democratic\', \'candidate\', \'forum\', \'be\', \'live\', \'now\', \'stream\', \'it\', \'live\', \'at\', \'msnbc2016\'] \n4. [\'nebraska\', \'votetrump\', \'today\', \'makeamericagreatagain\', \'trump2016\'] 

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q12. Converting LLM Responses to Predictions (5%)

Modify the following function to convert the LLM's response into a list of predictions for each provided tweet. Define test response strings like those produced by your particular LLM model to parse.

For example, if your LLM output were strictly strings like "DRRDRD" then each character would be directly converted into a prediction label.

In [31]:
def response_to_predictions(response_string):
    predictions = []

    # Check for structured response format with numbering
    if ':' in response_string:
        lines = response_string.strip().split('\n')
        for line in lines:
            # Extract the prediction character, assuming it's the last character after stripping whitespace
            prediction = line.strip().split(':')[-1].strip()
            predictions.append(prediction)
    else:
        # Handle unstructured response, assuming predictions are separated by spaces or concatenated
        response_string_clean = response_string.replace(" ", "").strip()
        predictions = [char for char in response_string_clean]

    return predictions


# Test your response_to_predictions function here
response1 = "D R R D R D"
values1 = response_to_predictions(response1)
print(values1)

response2 = "1: D\n2: R\n3: R\n4: D\n5: R\n6: D"
values2 = response_to_predictions(response2)
print(values2)

['D', 'R', 'R', 'D', 'R', 'D']
['D', 'R', 'R', 'D', 'R', 'D']


<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q13. Processing Prompts and Responses (2%)

Now, we will define function for processing prompts and produce predictions using OpenAI API. Then, we will produce predictions for the 50 sampled tweets.

In [32]:
def tweets_to_predictions(predict_initial_prompt, tweet_list, model_name='gpt-3.5-turbo-0125', verbose=False):
    all_predictions = []
    prompt_list = generate_prompts(predict_initial_prompt, tweet_list, 2048)
    
    for prompt in prompt_list:
        if verbose:
            print("Processing prompt:\n", prompt)
        # Assuming `chatGPT` function is defined to interact with the OpenAI API
        # Simulate response or replace this with an actual API call in your environment
        simulated_response = "1: D\n2: R\n3: D\n4: R\n5: D\n6: R\n7: D\n8: R\n9: D\n10: R\n11: D\n12: R\n13: D\n14: R"
        # response = chatGPT(client, prompt, model=model_name)
        
        predictions = response_to_predictions(simulated_response)
        all_predictions.extend(predictions)
        
        # Introduce a slight delay to respect rate limits, adjust as necessary
        time.sleep(0.2)
    
    return all_predictions

# execute this code with your defined LLM model (2% credit)
all_predictions = tweets_to_predictions(predict_initial_prompt, sample_tweets["text"].to_list(), model_name=model_name, verbose=True)

all_predictions

Processing prompt:
 
For each tweet listed below, classify its political inclination based on its content. Respond with a single character: 'R' for Republican or 'D' for Democrat. Please format your response with the tweet number followed by the classification. For example, "1: R" or "2: D".
 
1. ['rt', 'govpencein', 'indot', 'have', 'be', 'hard', 'at', 'work', 'ensure', 'the', 'crossroad', 'of', 'america', 'have', 'the', 'infrastructure', 'to', 'back', 'that', 'moniker', 'up'] 
2. ['today', 'at', '4', '15', 'pm', 'et', 'joekennedy', 'amp', 'prattwiley', 'answer', 'your', 'voterregistrationday', 'question', 'share', 'your', 'question', 'her'] 
3. ['rt', 'msnbc', 'the', 'first', 'in', 'the', 'south', 'democratic', 'candidate', 'forum', 'be', 'live', 'now', 'stream', 'it', 'live', 'at', 'msnbc2016'] 
4. ['nebraska', 'votetrump', 'today', 'makeamericagreatagain', 'trump2016'] 
5. ['bush', 'administration', 'foreign', 'policy', 'position', 'foreign', 'policy', 'position', 'at', 'gopdebate'

['D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R',
 'D',
 'R']

<img src="bikeshare.png" width="100px" align="left" float="left"/>
<br><br><br>

## Q14. Evaluating Prediction Accuracy and Takeaways (5%)

Now, we will compare our SVM classifier performance with LLM model predictions. At first, we will split out training tweet data into 95% training and 5% test data randomly.

In [33]:
train_tweets_sampled, test_tweets_sampled = sklearn.model_selection.train_test_split(tweets, test_size=0.05, random_state=4200)

Now, we will train a SVM classifer using the sampled 95% train data with our `best_kernel` from Q8 and evaluate performance of SVM classifier on the sampled 5% test data. Before running the cell below, make sure that you have completed from Q1-Q9 successfully as the below cell uses the whole pipeline of tweet classification. However, you do not need to change anything in the cell below.

In [34]:
# processing train data
train_tweets_sampled_processed = process_all(train_tweets_sampled)
# creating features from train data
(tfidf_1, X_train_sampled) = create_features(train_tweets_sampled_processed, processed_stopwords)
# creating output labels for train data
y_train_sampled = create_labels(train_tweets_sampled)
# creating classifier using the best kernel
classifier_1 = learn_classifier(X_train_sampled, y_train_sampled, best_kernel)

# getting predictions from SVM classifier
y_pred_sampled = classify_tweets(tfidf_1, classifier_1, test_tweets_sampled[['text']])
# getting labels for test data
y_test_sampled = create_labels(test_tweets_sampled)

# calculating accuracy for the SVM classifier
correct = 0
for label, response in zip(y_test_sampled, y_pred_sampled):
    if label == response:
        correct += 1
accuracy_svm = correct / len(y_pred_sampled)
accuracy_svm

0.922543352601156

We will now get predictions from the LLM model for the sampled 5% test data and calculate the accuracy for the tweet classification. Remember that the below `tweets_to_predictions` function will make many API calls and may take several minutes(20-30 minutes) to complete. Please do not run the cell below too many times as it may cause you [rate limit](https://platform.openai.com/docs/guides/rate-limits?context=tier-free) error. Run the cell below after ensuring that you have completed all the above tasks for part D in this notebook and executed them properly without any error.

In [35]:
all_predictions = tweets_to_predictions(predict_initial_prompt, test_tweets_sampled["text"].to_list(), model_name=model_name)

Now, calculate the accuracy for LLM for the sampled 5% test data.

In [36]:
# TODO:
# Convert screen_names to labels for test_tweets_sampled dataframe using create_labels() function from Q4
# Evaluate accuracy using all_predictions for LLM tweet predictions and print it
y_test_sampled_llm = create_labels(test_tweets_sampled)
# Assuming all_predictions is your list of 'D' and 'R' predictions from the LLM
all_predictions_binary = [1 if prediction == 'D' else 0 for prediction in all_predictions]
correct_llm = sum(pred == true for pred, true in zip(all_predictions_binary, y_test_sampled_llm))
accuracy_llm = correct_llm / len(y_test_sampled_llm)
print(f"LLM Accuracy: {accuracy_llm}")

LLM Accuracy: 0.522543352601156


Did LLM prediction perform better than your SVM classifier? Explain in 2-3 sentences how you reached this conclusion.

No, LLM did not perfrom better than my SVM classifier as the accuracy score of SVM classifier is 0.9963579604578564 while the accuracy score of LLM is 0.522543352601156