# LIN 371 Machine Learning for Text Analysis

# Homework 1 - due Monday Jan 29 2024 at 11:59pm

For this homework you will hand in (upload) to canvas:
- a notebook renamed ``hw1_YourEID.ipynb``

__Before submitting__, please reset your kernel and rerun everything from the beginning (`Kernel` >> `Restart and Run All`) an ensure your code outputs the correct answer. 

A perfect solution for this homework is worth 95 points but will be counted out of 100. If you completed homework 0, you will automatically receive an additional 5 points.  For programming tasks, make sure that your code can run using Python 3.5+. If you cannot complete a problem, include as much pseudocode as possible for partial credit. However, make sure it does not have any output errors. **If there are any output errors, half of the points for that problem will be automatically deducted.**

Collaboration: you are free to discuss the homework assignments with other students and work towards solutions together.  However, all of the code you write must be your own! There is a channel on Slack where you can look for a study group.

Review extension and academic dishonesty policy here: https://jessyli.com/courses/lin371

For typing up your answers to problems 1, 2 and 3, information can be found about Markdown cells for Jupyter Notebooks here: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html


### Please list any collaborators here:
- Rylan Vachon





## Problem 1: the paradox of induction (15 points)

Consider a statement whose truth is unknown. If we see many examples that are compatible with it, we are tempted to view the statement as more probable. Such reasoning is often referred to as _inductive inference_ (in a philosophical, rather than mathematical sense). Consider now the statement that "all cows are white". An equivalent statement is that "everything that is not white is not a cow". We then observe several black panthers. Our observations are clearly compatible with the statement, but do they make the hypothesis "all cows are white" more likely?

To analyze such a situation, we consider a probabilistic model. Let us assume that there are two possible states of the world, which we model as complementary events:

<center> $A$: all cows are white,
    
<center> $A^c$: 50% of all cows are white.

Let $p$ be the prior probability $P(A)$ that all cows are white. We make an observation of a cow or a panther, with probability $q$ and $1-q$, respectively, independent of whether event $A$ occurs or not. Assume that $0<p<1, 0<q<1$, and that all panthers are black.




### (a) Given the event $B=$\{a black panther was observed\}, what is $P(A|B)$? Show your work (5pts)

If all panthers are black, then $P(B)$ = 1

We need to find $P(A|B)$ which can be written as $\dfrac {P(A \cap B)}{P(B)}$

* $P(A|B)$ = $\dfrac {P(A \cap B)}{P(B)}$

    * $\dfrac {P(A) \cdot P(B)}{P(B)}$

    * $P(B) = 1$ 
    
    * $P(A) = p$
    
    * $\dfrac {p \cdot 1}{1}$

    * $P(A|B) = p$

Therefore, seeing a black panther does not improve the probability that all cows are white. The two events are independent.

### (b) Given the event $C=$\{a white cow was observed\}, what was $P(A|C)$? Show your work (5pts)

We need to find $P(A|C)$ using the Bayesian Model.

* $P(A|C) = \dfrac{P(C|A) \cdot P(A)}{P(C)}$

    * $\dfrac{P(C|A) \cdot p}{P(C)}$

    * $\dfrac{1 \cdot p}{P(C)}$

        * $\dfrac{1 \cdot 1}{P(C)}$

        * $P(C)$ is the probability of randomly observing a white cow.

        * $P(A|C) = \dfrac{p}{P(C)}$

            * $P(C) =  P(C,A) + P(C,A^C)$ We need to consider both the event $A$ and the complement $A^C$

            * $(P(C|A) \cdot P(A)) + (P(C|A^C) \cdot P(A^C))$

            * $(P(C|A) \cdot p) + (P(C|A^C) \cdot P(A^C))$

            * $(1 \cdot p) + (P(C|A^C) \cdot P(A^C))$

            * $(p) + (P(C|A^C) \cdot P(A^C))$

            * $(p) + (0.5 \cdot P(A^C))$

            * $p + (0.5 \cdot (1-p))$

            * $p + (0.5 - \dfrac{1}{2p})$

            * $\dfrac{1}{2p} + 0.5$

            * $P(C|A) = \dfrac{1}{2p} + 0.5$
        
* $P(A|C) = \dfrac{p}{\dfrac{1}{2p} + 0.5}$

The probability of $P(A|C)$ is $\dfrac{p}{\dfrac{1}{2p} + 0.5}$.

### (c) Which is larger? Explain the implication. (5pts)



***
## Problem 2: MAP (11 pts)

We have discussed the Bernoulli distribution. Now, if you flip the coin $N$ times, with $H$ heads and $T$ tails, the likelihood of observing a sequence (data) $D$ is:
$$
P(D|\theta) = \theta^H(1-\theta)^T
$$

In the Bayesian framework, we assume that we have some prior knowledge of $\theta$ which can be described with a distribution $P(\theta)$. 
The Beta distribution $Beta(\alpha;\beta)$ is often used as this prior, which, combined with our observed data $D$, leads to the posterior distribution $P(\theta|D)$. 

But what if we want a single number (an estimate) of our "best" $\theta$? This value will no longer be the same as the maximum likelihood estimate $\hat{\theta}_{MLE}$. Instead, this new "best" parameter, dubbed $\hat{\theta}_{MAP}$, is called _maximum a posterior_ or MAP: $$\hat{\theta}_{MAP}=\text{argmax}_\theta(P(\theta|D)$$

### (a) What is $\hat{\theta}_{MAP}$? Express it in terms of $H,T,\alpha,\beta$.

$$\hat{\theta}_{MAP}=\text{argmax}_\theta(P(\theta|D)$$

${argmax}_\theta(P(\theta|D)$ =


---
## Problem 3: Decision Trees (30 points)

Consider the following set of training examples where we have two features, $X_1$ and $X_2$, and the goal is to predict the target $Y$. Each row indicates the values observed, and how many times that set of values was observed. For example, $(+,T,T)$ was observed 3 times, while $(−,T,T)$ was never observed.

|Y | $$X_1$$ | $$X_2$$ | Count|
|-|-|-|-|
|+ | T | T | 3|
|+ | T | F | 4|
|+ | F | T | 4|
|+ | F | F | 1|
|- | T | T | 0|
|- | T | F | 1|
|- | F | T | 3|
|- | F | F | 5 |







### (a) What is the sample entropy $H(Y)$ for this training data (with logarithms base 2), _before_ we start splitting on either feature? (10 pts)

The equation for entropy is $H(Y) =$ $-\sum{y} P(Y = y)log_{2}P(Y=y)$

The base entropy for this training data would be:

* $H(Y) = -P(Y=+)log_{2}P(Y=+) - P(Y=-)log{2}P(Y=-)$

    * $(-\dfrac{12}{21}\cdot log_{2}\dfrac{12}{21}) - (\dfrac{9}{21}\cdot log_{2}\dfrac{9}{21})$

    * $(-0.5714 \cdot -0.8074) - (0.4285 \cdot -1.2224)$

    * $0.4613 - -0.5237$

    * $H(Y) = 0.985$

The sample entropy for $H(Y)$ before we start splitting on either feature is 0.985.


### (b)  Should we first split on $X_1$ or $X_2$? Calculate the information gains for each feature. (20 pts)

The forumla for conditional entropy is $H(Y|X) = P(X) + H(Y|X = x)$

#### Information gain for $X_1$:
##### Conditional Entropy for when $X_1$ = T

* $H(Y|X=X_1=T) = (-\dfrac{7}{8}\cdot log_{2}\dfrac{7}{8}) - (\dfrac{1}{8}\cdot log_{2}\dfrac{1}{8})$

    * $(-0.875 \cdot -0.1926) - (0.125 \cdot -3)$

    * $0.1685 - -0.375$

    * $H(Y|X=X_1=T) = 0.5435$

    * $H(Y|X) = P(X=X_1=T) \cdot H(Y|X=X_1=T)$

        * $\dfrac{8}{21} \cdot 0.5435$

        * $H(Y|X) = 0.207$

##### Conditional Entropy for when $X_1$ = F

* $H(Y|X=X_1=F) =(-\dfrac{5}{13}\cdot log_{2}\dfrac{5}{13}) - (\dfrac{8}{13}\cdot log_{2}\dfrac{8}{13})$

    * $(-0.3845 \cdot -1.3785) - (0.6153 \cdot -0.7004)$

    * $0.5300 - -0.4309$

    * $H(Y|X=X_1=F) = 0.9609$

    * $H(Y|X) = P(X=X_1=F) \cdot H(Y|X=X_1=F)$

        * $\dfrac{13}{21} \cdot 0.9609$

        * $H(Y|X) = 0.5948$

The formula for information gain is $H(Y) - H(Y|X)$

*   $0.985 - (0.207 + 0.5948) = 0.1832$

The information gain when we split on $X_1$ is 0.18 bits.

___

#### Information gain for $X_2$:
##### Conditional Entropy for when $X_2$ = T

* $H(Y|X=X_2=T) = (-\dfrac{7}{10}\cdot log_{2}\dfrac{7}{10}) - (\dfrac{3}{10}\cdot log_{2}\dfrac{3}{10})$

    * $(-0.7 \cdot -0.5146) - (0.3 \cdot -1.737)$

    * $0.360 - -0.5211$

    * $H(Y|X=X_2=T) = 0.8813$

    * $H(Y|X) = P(X=X_2=T) \cdot H(Y|X=X_2=T)$

        * $\dfrac{10}{21} \cdot 0.8813$

        * $H(Y|X) = 0.4196$

##### Conditional Entropy for when $X_2$ = F

* $H(Y|X=X_2=F) = (-\dfrac{5}{11}\cdot log_{2}\dfrac{5}{11}) - (\dfrac{6}{11}\cdot log_{2}\dfrac{6}{11})$
    
    * $(-\dfrac{5}{11}\cdot-1.1375)-(\dfrac{6}{11}\cdot-0.8745)\ $

    * $0.5169 - -0.477$

    * $H(Y|X=X_2=F) = 0.9940$

    * $H(Y|X) = P(X=X_2=F) \cdot H(Y|X=X_2=F)$

        * $\dfrac{11}{21} \cdot 0.9940$

        * $H(Y|X) = 0.5206$

The formula for information gain is $H(Y) - H(Y|X)$

*   $0.985 - (0.4196 + 0.5206) = 0.0448$

The information gain when we split on $X_2$ is 0.04 bits.

### Conclusion

$X_1$ information gain is 0.18 and $X_2$ information gain is 0.04. We should split on $X_1$ first.

---
## Problem 4:  log odds ratios (44 points)
This exercise is an exploratory analysis of the Sentiment140 dataset. Sentiment140 combines 160K tweets collected via the Twitter API with most of the emoticons removed. Each tweet is annotated with polarity: positive (4), negative (0) and neutral (2). You do not have to check the original paper that proposed this dataset, but if you are curious, here is the link: [https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

In this problem, we will analyze how often a word tend to appear with a positive sentiment vs. a negative one. The metric we are going to use is  **log odds ratio**, that compares the conditional probability of a word occurring in one type of sentences, say, positive ($P(word|pos)$), and the word occurring in another type of sentences, say, negative ($P(word|neg)$):
$$log\_odds\_ratio(word, pos) = \log\frac{P(word|pos)}{P(word|neg)}$$
The higher the $log\_odds\_ratio$, the more likely the word is associated with positive sentences.

We will use the Sentiment140 dataset. Sentiment140 combines 1.6 million tweets collected via the Twitter API with most of the emoticons removed. Each tweet is annotated with polarity: positive (4), negative (0) or neutral (2). _We will  **not** consider neutral tweets in this problem_. You do not have to check the original paper that proposed this dataset, but if you are curious, here is the link: [https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf](https://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf).

Download from Canvas the file ``sentiment140_sample1.csv`` ---a 10K sample from the training set of Sentiment140---and put it under the  **same directory** (folder) as your python script or notebook file. As a reminder, the file is formatted under six fields, including polarity, tweet ID, date, query username and the text of the tweet. We will only use polarity and tweet text in this assignment.

In the following exercises, we have provided several expected inputs and outputs of the functions that you will implement. Treat these as test cases for your code; if you get numbers very far off from what is listed here with the same input, you have bugs to crush.

In [1]:
# Read in the data
import pandas
sentiment_data = pandas.read_csv("sentiment140_sample1.csv", header = None, encoding = "ISO-8859-1")
sentiment_data.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1752285545,Sat May 09 21:30:51 PDT 2009,NO_QUERY,Doomed_Vampire,Doesn't know what to eat! What would you eat ...
1,0,1468206560,Tue Apr 07 00:17:43 PDT 2009,NO_QUERY,rynequin,"Doesn't know why, but is feeling very down. An..."
2,0,2179666845,Mon Jun 15 09:25:05 PDT 2009,NO_QUERY,Deeluvly,"@OfficialTG3 you know I'm leaving Texas, right?"
3,0,2219598618,Thu Jun 18 00:54:58 PDT 2009,NO_QUERY,katyshepherdx,I really hate her.
4,0,1822816157,Sat May 16 20:27:14 PDT 2009,NO_QUERY,jennyconfetti,No one's replying to my texts


In [2]:
sentiment_data.describe()

Unnamed: 0,0,1
count,10000.0,10000.0
mean,1.9844,2001648000.0
std,2.000039,192134800.0
min,0.0,1467817000.0
25%,0.0,1957661000.0
50%,0.0,2002835000.0
75%,4.0,2177586000.0
max,4.0,2329052000.0


### (a) Frequency counts  (11 points)
First, let's create dictionaries that record the count of each word in positive tweets, as well as the count of each word in negative tweets. Here, here, ``counts["pos"]`` will contain key-value pairs of a word and its number of appearance in positive tweets, ``counts["neg"]`` will contain key-value pairs of a word and its number of appearance in negative tweets

To parse the tweets, we will use NLTK's ``word_tokenize()`` function. As an example, the following tokenizes a sentence into a list of words:

In [3]:
import nltk
nltk.download('punkt') #you only have to do this once per environment

from nltk import word_tokenize
word_tokenize("This is a sentence.")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/eloraghespie/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['This', 'is', 'a', 'sentence', '.']

Lower-casing all words gives cleaner counts. For example, consider the two sentences: "Apples are delicious. John loves apples." If we do not lower-case each word, ''Apples'' and ''apples'' will be counted as two different words. In Python, you can lower-case a word by calling ``lower()``:

In [4]:
print("Apples" == "apples")
print("Apples becomes", "Apples".lower())
print("Apples".lower() == "apples")

False
Apples becomes apples
True


We will only consider words and not symbols or numbers. To test whether a word is a word, that is, consisting of only English characters, we can use ``isalpha()``:

In [5]:
print("Apples".isalpha())
print("Apples123".isalpha())

True
False


**Complete the code below**

In [10]:
def get_counts(data):
    """ 
    counts the number of times a word appears in negative or positive tweets
    
    Parameters:
    data: Pandas dataframe of tweets
    
    Returns:
    counts: Dictionary of counts, which includes the dictionaries 'pos' and 'neg'
    
    """
    
    counts = {"pos":{}, "neg":{}}
    
    # loop through the rows
    for idx, row in sentiment_data.iterrows():

        # lowercase and tokenize each tweet
        data = word_tokenize(row[5].lower())

        # check for negative tweets
        if row[0] == 0:
            for word in data:
                # only mark the alpha characters
                if word.isalpha():
                    # add to dictionary
                    if word in counts['neg']:
                        counts['neg'][word] += 1
                    else:
                        counts['neg'][word] = 1

        # check for positive tweets
        elif row[0] == 4:
            for word in data:
                # only mark the alpha characters
                if word.isalpha():
                    # add to dictionary
                    if word in counts['pos']:
                        counts['pos'][word] += 1
                    else:
                        counts['pos'][word] = 1

    
    return counts

    
# Do not change
counts = get_counts(sentiment_data)

print(counts["pos"]["happy"]) # should print 115
print(counts["neg"]["happy"]) # should print 31
print(counts["pos"]["hate"]) # shuld print 17
print(counts["neg"]["hate"]) # should print 106


115
31
17
106


### (b) Calculating $P(\text{word}|\text{polarity})$ (11 points)

Create a function ``get_word_prob(counts, word, polarity)``, where ``counts`` is a dictionary like in the previous task, ``word`` is the word for which $P(word|polarity)$ will be calculated, and ``polarity`` is either ``pos`` or ``neg``. The function should return $P(word|polarity)$. If ``counts[polarity]`` does not contain ``word``, then return 0.

Note that you should NOT need to use the variable ``data`` here, and only rely on the three arguments of the function: ``counts, word, polarity``.


In [7]:
def get_word_prob(counts, word, polarity):
    """ 
    calculates the probability of a word given a polarity 
    
    Parameters:
    counts (dict): the dictionaries 'pos' and 'neg' which count word occurances
    word (str): the word you want to get the probability for
    polarity (str): wither 'pos' or 'neg'
    
    Returns:
    probability (float):  the probability of a word given a polarity 
    
    """
    # Your code goes here

    # number of times a word appears in the given polarity dict
    # 0 if it's not there at all
    count = counts[polarity][word] if word in counts[polarity] else 0

    # the number of word occurences all summed within the given polarity dict
    pol = sum([value for value in counts[polarity].values()])

    probability = count/pol
    
    return probability # P(word|polarity)

#Do not change
print(get_word_prob(counts, "great", "pos")) # should be ~0.00239
print(get_word_prob(counts, "glad", "neg")) # should be ~0.000255
print(get_word_prob(counts, "wugs", "neg")) # should be 0


0.002390373899701203
0.00025490313680801295
0.0


### (c) Calculate the log odds ratio of a word  (11 points)


Using the above function, we can calculate $P(word|pos)$ and $P(word|neg)$ given a word, so we are ready to calculate the log odds ratio of that word as well. Create a function ``log_odds_ratio(count_dict, word, polarity)``, where the arguments are of the same type/format as in the previous problem. The function should return $log\_odds\_ratio(word)$:

$$ log\_odds\_ratio(word, polarity) = \log\frac{P(word|polarity)}{P(word|opposite\_polarity)} $$

If the denominator is zero, return a very large number (please return 10000). Again you should NOT need to use the variable ``data`` here, and only rely on the three arguments of the function: ``counts``, ``word``, and ``polarity``.

In [8]:
def log_odds_ratio(counts, word, polarity):
    """ 
    This function returns the log odds ratio of a term (see previous cell)
    
    Parameters:
    counts (dict): the dictionaries 'pos' and 'neg' which count word occurances
    word (str): the word you want to get the probability for
    polarity (str): wither 'pos' or 'neg'
    
    Returns:
    log_odds_ratio (float): log( prob(word|plarity) / P(word|opposite_polarity) )
    
    """
    # Your code goes here

    import math

    # identify what the opposite polarity is
    if polarity == 'pos':
        opposite_polarity = 'neg'
    else:
        opposite_polarity = 'pos'

    # calculate the log odds ratio
    try:    
        log_odds_ratio = math.log(get_word_prob(counts, word, polarity) / get_word_prob(counts, word, opposite_polarity))

    # if an error is raised, likely a ZeroDivision Error, return the number 
    except:
        log_odds_ratio = 10000
    
    return log_odds_ratio


# Do not change
print(log_odds_ratio(counts, "great", "pos")) # should be ~1.287
print(log_odds_ratio(counts, "the", "neg")) #  should be ~-0.0779
print(log_odds_ratio(counts, "wug", "neg")) # should be a very large number

1.2873451688770043
-0.0779544942827149
10000


### (d) Sorting log odds ratios (11 points)

After being able to calculate log odds ratios for individual words, we can now sort words according to its association with a polarity class, say, positive. Create a function ``sort_pos_words(data)``, that takes in the entire dataframe as an argument, and return a sorted list of ``(word, log odds ratio)`` tuples for the positive sentiment class.

If you implement this without filtering out any words, you will notice that there are many cases where the conditional probability of the denominator is 0, leading to the very large number you specified in the ``log_odds_ratio()`` function. This is because most words appear only once in the dataset. One way to mitigate this issue is to consider only words that appeared at least $x$ times in the dataset; here, let's only include words that appeared more than 10 times in the dataset, regardless of the polarity of the tweet (positive or negative).

Use your function to print out the top 10 most positive (Note: we only consider those words appear in the positive sentiment class, namely you only need to sort the words in positive sentiment class and take the top-10 and bottom-10):

`` [('followfriday', 10000), ('ff', 10000), ('proud', 10000), ('yummy', 3.1187445986382114), ('congrats', 3.0699544344687792), ('moon', 2.713279490530047), ('exciting', 2.639171518376325), ('gorgeous', 2.5591288107027883), ('welcome', 2.5165691962839927), ('prom', 2.4721174337131586)]``
       
 and the top 10  most negative:
 
`` [('bummed', -2.410684488873212), ('scared', -2.410684488873212), ('stomach', -2.4907271965467483), ('ugh', -2.5648351687004705), ('happened', -2.8702168182516523), ('worse', -2.9215101126392025), ('throat', -2.9703002768086346), ('died', -3.144653663953412), ('sad', -3.5203466137279067), ('hurts', -3.589339485214858)]``

In [9]:
def sort_pos_words(data):
    """
    takes in a pandas dataframe and outputs the top 10 most positive and negative words in the dataset
    
    Parameters:
    data (pandas.DataFrame): the tweets in a dataframe
    
    Return:
    sorted_list (list): a sorted list of (word, log odds ratio) tuples for the positive sentiment class
    
    """
    # Your code goes here
    
    sorted_list = []
    counts = get_counts(data)

    # we are only concerned with words that are in pos
    for word in counts['pos']:
        # if it occurs more than 10 times across the entire dataset
        if counts['pos'][word] + (counts['neg'][word] if word in counts['neg'] else 0) > 10:
                # take the log odds ratio and put it in the list
                sorted_list.append((word, log_odds_ratio(counts, word,'pos')))
    
    # sort the list based on log odds ratio
    sorted_list = sorted(sorted_list, key=lambda x: x[1], reverse=True)
    
    return sorted_list
    
# Do not change
lst = sort_pos_words(sentiment_data)
print("Top 10 most positive \n", lst[:10]) # see previous cell for what this should print
print("Top 10 most negative \n", lst[-10:])    

Top 10 most positive 
 [('followfriday', 10000), ('ff', 10000), ('proud', 10000), ('yummy', 3.1188449667545735), ('congrats', 3.0700548025851417), ('moon', 2.7133798586464093), ('exciting', 2.639271886492687), ('gorgeous', 2.559229178819151), ('welcome', 2.5166695644003547), ('prom', 2.472217801829521)]
Top 10 most negative 
 [('bummed', -2.41058412075685), ('scared', -2.41058412075685), ('stomach', -2.4906268284303863), ('ugh', -2.564734800584108), ('happened', -2.87011645013529), ('worse', -2.9214097445228404), ('throat', -2.9701999086922726), ('died', -3.14455329583705), ('sad', -3.5202462456115446), ('hurts', -3.589239117098496)]
