# Work Sheet for Week 03: Text Classification & Sentiment Analysis + Logistic Regression

## Background

In this worksheet, we will follow the material covered either in [Chapter 4 of the JM book](https://web.stanford.edu/~jurafsky/slp3/4.pdf) or in the [Text Classification and Sentiment Analysis Playlist](https://www.youtube.com/playlist?list=PLaZQkZp6WhWxU3kA6wV0nb5dY1SXDEKWH). For the questions about logistic regression, you can refer to [Chapter 5 of the JM book](https://web.stanford.edu/~jurafsky/slp3/5.pdf).

Please keep the provided files in your worksheet folder to be able to run the given scripts and see the text examples without any errors.


In [None]:
from os import path
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd = path.abspath('/content/drive/MyDrive/studies/Potsdam/ANLP')

In [None]:
!ls '/content/drive/MyDrive/studies/Potsdam/ANLP'

 01-anlp20-worksheet-ryazanskaya.ipynb	'Galina Ryazanskaya.gslides'
 02-anlp20-worksheet-ryazanskaya.ipynb	 movie_reviews.json
 03-anlp20-worksheet-ryazanskaya.ipynb	 turkish_news_examples.json


### Exercises

**[E1] You may have noticed that our classification models require a significant number of training examples in order to converge to the underlying probability distribution in the dataset. This brings about the necessity and importance of annotated data: Models require more and more annotated examples each day.**

**In this exercise, you'll have the chance to annotate your own small dataset (which is basically taken from [this Kaggle competition](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview) and represented as a json file that includes only 30 examples). Please answer the questions [E1a] and [E1b], after loading the dataset by using the following script:**

In [None]:
import json 
import pandas as pd

pd.set_option('display.max_colwidth', None)

# open JSON file 
f = open(cd + '/' + 'movie_reviews.json',) 
  
# converts JSON object to dictionary,  and then to a dataframe 

reviews = json.load(f) 

movie_reviews_df = pd.DataFrame(reviews['movie_reviews'])

movie_reviews_df.style.hide_index() 

review_id,review,label
r1,"A series of escapades demonstrating the adage that what is good for the goose is also good for the gander, some of which occasionally amuses but none of which amounts to much of a story.",
r2,"This quiet, introspective and entertaining independent is worth seeking",
r3,"Even fans of Ismail Merchant's work , I suspect, would have a hard time sitting through this one.",
r4,"Cattaneo should have followed the runaway success of his first film, The Full Monty, with something different.",
r5,"A positively thrilling combination of ethnography and all the intrigue, betrayal, deceit and murder of a Shakespearean tragedy or a juicy soap opera.",
r6,Aggressive self-glorification and a manipulative whitewash.,
r7,A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis.,
r8,"Narratively, Trouble Every Day is a plodding mess.",
r9,"The Importance of Being Earnest, so thick with wit it plays like a reading from Bartlett's Familiar Quotations.",
r10,"There's little to recommend Snow Dogs, unless one considers cliched dialogue and perverse escapism a source of high hilarity.",


**[E1a] Try to annotate each movie review example in the dataframe by only using two class labels: 'Positive' or 'Negative'. Write down your annotations as a list of pairs of ID of review, numerical encoding of label, so that (1,0) might mean "example 1 is annotated with label 0. You can select 0 as the negative and 1 as the positive label for convenience. Then compare it with the results of your peers. Can you think of a way of quantifying similarities and differences between your annotations?**

If one has to compare more than two sets of labels, one can come up with different measures for this, the simplest being the agreement fraction.

To calculate, how unambigously positive or negative something is, one can compute, how many of the annotators gave the positive score and divide that by the number of annotators. To derive a gold standard one might use some threshold, such as mean agreement.

To calculate the agreement between the annotators themselves, one might use Cohen's kappa (a statistical tool for measuring inter-rater agreement between two annotators, and Fleiss' kappa for more)

In [None]:
movie_reviews_df['label'] = [0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0]
movie_reviews_df.style.hide_index() 

review_id,review,label
r1,"A series of escapades demonstrating the adage that what is good for the goose is also good for the gander, some of which occasionally amuses but none of which amounts to much of a story.",0
r2,"This quiet, introspective and entertaining independent is worth seeking",1
r3,"Even fans of Ismail Merchant's work , I suspect, would have a hard time sitting through this one.",0
r4,"Cattaneo should have followed the runaway success of his first film, The Full Monty, with something different.",0
r5,"A positively thrilling combination of ethnography and all the intrigue, betrayal, deceit and murder of a Shakespearean tragedy or a juicy soap opera.",1
r6,Aggressive self-glorification and a manipulative whitewash.,0
r7,A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis.,1
r8,"Narratively, Trouble Every Day is a plodding mess.",0
r9,"The Importance of Being Earnest, so thick with wit it plays like a reading from Bartlett's Familiar Quotations.",1
r10,"There's little to recommend Snow Dogs, unless one considers cliched dialogue and perverse escapism a source of high hilarity.",0


**[E1b] Imagine doing the task at [E1a] once more, this time having five class labels: 'Positive', 'Somewhat Positive', 'Neutral', 'Somewhat Negative' or 'Negative'. How would you think the differences with respect to your colleagues change? How would you set guidelines for a more robust annotation process, considering also certain arbitration rules for examples with seemingly a lot of different annotation candidates?**

The more labels the more people disagree, no matter the quality of the guidelines. For this particular task, I realized that I only marked 'Neutral' the movies contatining factual information with no implicit emotional assessment. Everything that contained facts and a neutral assessment, but no positive emotion can be considered somewhat negative, as there was nothing notable about the movie.

In [None]:
movie_reviews_df['label_multple'] = ['Neutral', 'Positive', 'Negative', 'Negative', 'Positive', 'Negative', 'Positive', 
                                     'Negative', 'Somewhat Positive', 'Somewhat Negative', 'Neutral', 'Positive', 'Positive', 
                                     'Positive', 'Positive', 'Somewhat Positive', 'Somewhat Positive', 'Negative', 'Somewhat Negative', 
                                     'Positive', 'Neutral', 'Negative', 'Somewhat Negative', 'Somewhat Negative', 'Positive', 'Negative', 
                                     'Somewhat Negative', 'Positive', 'Somewhat Positive', 'Somewhat Negative']
movie_reviews_df.style.hide_index() 

review_id,review,label,label_multple
r1,"A series of escapades demonstrating the adage that what is good for the goose is also good for the gander, some of which occasionally amuses but none of which amounts to much of a story.",0,Neutral
r2,"This quiet, introspective and entertaining independent is worth seeking",1,Positive
r3,"Even fans of Ismail Merchant's work , I suspect, would have a hard time sitting through this one.",0,Negative
r4,"Cattaneo should have followed the runaway success of his first film, The Full Monty, with something different.",0,Negative
r5,"A positively thrilling combination of ethnography and all the intrigue, betrayal, deceit and murder of a Shakespearean tragedy or a juicy soap opera.",1,Positive
r6,Aggressive self-glorification and a manipulative whitewash.,0,Negative
r7,A comedy-drama of nearly epic proportions rooted in a sincere performance by the title character undergoing midlife crisis.,1,Positive
r8,"Narratively, Trouble Every Day is a plodding mess.",0,Negative
r9,"The Importance of Being Earnest, so thick with wit it plays like a reading from Bartlett's Familiar Quotations.",1,Somewhat Positive
r10,"There's little to recommend Snow Dogs, unless one considers cliched dialogue and perverse escapism a source of high hilarity.",0,Somewhat Negative


**[E2] In this exercise, you'll be provided with 4 different Turkish newspaper texts: These news belong to 4 different class labels: 'technology', 'sport', 'art and entertainment' and 'economy'.**<br>

**Assuming that none of you speak Turkish, how would you assign a unique label to each of these texts given in the dataframe below? Run the script to see the dataframe and think about the appropriate label for each text.**



In [None]:
import json 
import pandas as pd

pd.set_option('display.max_colwidth', None)

# open JSON file 
f = open(cd + '/' + 'turkish_news_examples.json',) 
  
# converts JSON object to dictionary,  and then to a dataframe 

news = json.load(f) 

turkish_news_df = pd.DataFrame(news['turkish_news'])

turkish_news_df.at[0, 'label'] = "economy"
turkish_news_df.at[1, 'label'] = "sport"
turkish_news_df.at[2, 'label'] = "technology"
turkish_news_df.at[3, 'label'] = "art and entertainment"


turkish_news_df.style.hide_index()

news_id,review,label
n1,"Fitch'in kararına yurtdışından tepkiler: Fitch'in Türkiye'nin kredi derecelendirme notunu yükseltmesine yurt dışındaki ekonomistlerden de olumlu tepkiler geldi. UBS gelişmekte olan piyasalar strateji uzmanı Manik Narain, 'Kredi piyasaları Türkiye ile bir süredir yatırım yapılabilir notu almış gibi ticaret yapıyordu. Bugünden itibaren reel bir yatırım akışı başlayacak ve kredi masrafları düşecek' dedi. Standart Bank gelişmekte olan piyasalar araştırma müdürü Timothy Ash ise 'Yatırım yapılabilir notu Türkiye'ye uzun zaman önce verilmeliydi' diye konuşurken ülkeye dayatılan dış finansal risklerin uzun süredir abartıldığını belirtti. Ash, bu durumun Türkiye'ye güveni arttıracağını, Moody ve S&P'nin de Fitch'i takip etmesini beklediğini sözlerine ekledi.",economy
n2,"Herkes Burak'a Murat Burak'a. Büyükşehir maçında en çok pas '43'le Burak Yılmaz'a atıldı. G.Saray'daki en iyi futbolunu oynayan Kral da kaleye sırtı dönük aldığı pasları, Umut'a asiste çevirdi. G.Saray Teknik Direktörü Fatih Terim, Cluj sınavının provasını yaptığı Büyükşehir maçında taktik zekasından örnekler sergiledi. Fatih Hoca, '3 pasta gol' atma dersi verdi. Cim Bom'un 10 gol girişiminde ortalama pas sayısı 3.20 oldu. 3 pasta kaleye gitme taktiğinin odak noktasındaki isimse Burak Yılmaz'dı.",sport
n3,"Bayramda siber suç kurbanı olmamanın 7 yolu: Dünyanın en büyük antivirüs yazılım kuruluşlarından ESET, Türkiye'de Kurban Bayramı döneminde seyahatlerin ve kişisel bilgisayar kullanımının artması nedeniyle kullanıcıların siber suç kurbanı olmaması için uyarılarda bulundu. 'USB gibi taşınabilir hafıza kartlarının böyle dönemlerde paylaşımı artıyor' diyen ESET Türkiye Genel Müdür Yardımcısı Alev Akkoyunlu, 'Kontrolsüz kullanılan bellekler günümüzün en büyük virüs bulaştırıcıları' açıklamasını yaptı. Akkoyunlu, bu dönemde internette bayram armağanı olarak ulaşan 'imkansız' teklifler ve mesajlar konusunda da uyanık olmaya çağırarak bayramda güvende kalmak için 7 öneride bulundu.",technology
n4,"Stok, 17 Ekim'de Sangria Live'da Bağdat Caddesi’nin tek rock barı olan Sangria, sizi genç ve enerjik bir grupla eğlenceye davet ediyor. 17 Ekim Cumartesi akşamı Stok en yeni ve eğlenceli repertuvarı ile sahnede olacak. Bir yerli içki dahil 25 TL olan konser hafta ortası keyfinizi yerine getirecek. 13 yıldır rock severlerin Bağdat Caddesi’ndeki favori mekanı olan Sangria Live’da konserler son hızıyla devam ediyor.",art and entertainment


**How do you make your predictions? Do you look at any particular words to decide on the actual label? How does the way that you solved this task relate to how Naive Bayes does classification?**

I looked at individual words that seemed to be similar to ones in english and to be related to the given topics. "Standart Bank" is clearly about economy. "gol", that is repeated in n2, might be connected to "goal" from sports. "USB" in text 3 is a nice indication of it being tech-related. The leftover then might be about arts and entertainment.

The way I solved it is similar to NB in that I have some words that seem to be stron class predictors (P(Tech|n3 given USB in n3) is high) and I use those to indicate the class, while disregarding all the unfamiliar words.

**[E3] If the document is represented by a certain type of feature, there is a direct relationship between a Naive Bayes classifier and class conditional unigram models. How? Did you use such a relationship while solving [E2]?**<br>

**In other words, you need to consider a given text document represented by a certain type of feature (e.g. words themselves, parts of speech of words, pronunciations etc.), and our text classification model works with the mere counts of each type in the feature vocabulary. Would such a model resemble to what we have experienced so far with unigram models?**

Yes, I did use it, as a word "agressive" or "manipulative" are expected in the negative class, while some others are not indicative of class (their probability seems equal for both classes, as they are equally seen in both positive and negaive reviews, eg stopwords). However, for me the vocabulary is not so much the training set (which we did not have here), but my exeprience with English, which can still be regarded as a baysian system, although a more complex one, as I do take into account more than the training corpus, more than word co-occurence, but also syntax, semantics, and, importantly, pragmatic context.


**[E4] What's the problem with correlated features for NB, and how may those occur in natural language?** <br> 

Hint: If you are not sure about how correlated features effect the performance of NB classifiers, you can check out this [link](https://machinelearningmastery.com/better-naive-bayes/) to get some ideas about how to design an NB algorithm.

Inter-correlated features in an NB skew the probability distribution towards the class both features belong to. A simple example of that might be classifying Asian Countries News (like in the textbook) and using NB on a bag-of-words. If we were to consider "Hong Kong" as two words, they would make the overall "Hong Kong"-class probability higher without adding any information, as compared to indictavie names that consist of one word only. 

NB works more reliably if one removes or down-weighs the inter-correlated features before training.

**[E5] Derive the gradient for binary logistic regression.**

**[E6] Derive the gradient for multi-class logistic regression.**