# Work Sheet for Week 03: Text Classification & Sentiment Analysis + Logistic Regression

## Background

In this worksheet, we will follow the material covered either in [Chapter 4 of the JM book](https://web.stanford.edu/~jurafsky/slp3/4.pdf) or in the [Text Classification and Sentiment Analysis Playlist](https://www.youtube.com/playlist?list=PLaZQkZp6WhWxU3kA6wV0nb5dY1SXDEKWH). For the questions about logistic regression, you can refer to [Chapter 5 of the JM book](https://web.stanford.edu/~jurafsky/slp3/5.pdf).

Please keep the provided files in your worksheet folder to be able to run the given scripts and see the text examples without any errors.


### Exercises

**[E1] You may have noticed that our classification models require a significant number of training examples in order to converge to the underlying probability distribution in the dataset. This brings about the necessity and importance of annotated data: Models require more and more annotated examples each day.**

**In this exercise, you'll have the chance to annotate your own small dataset (which is basically taken from [this Kaggle competition](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/overview) and represented as a json file that includes only 30 examples). Please answer the questions [E1a] and [E1b], after loading the dataset by using the following script:**

In [15]:
import json 
import pandas as pd

pd.set_option('display.max_colwidth', 100)

# open JSON file 
f = open('movie_reviews.json',) 
  
# converts JSON object to dictionary,  and then to a dataframe 

reviews = json.load(f) 

movie_reviews_df = pd.DataFrame(reviews['movie_reviews'])

movie_reviews_df.style.hide_index() 
movie_reviews_df.head()

Unnamed: 0,review_id,review,label
0,r1,A series of escapades demonstrating the adage that what is good for the goose is also good for t...,
1,r2,"This quiet, introspective and entertaining independent is worth seeking",
2,r3,"Even fans of Ismail Merchant's work , I suspect, would have a hard time sitting through this one.",
3,r4,"Cattaneo should have followed the runaway success of his first film, The Full Monty, with someth...",
4,r5,"A positively thrilling combination of ethnography and all the intrigue, betrayal, deceit and mur...",


**[E1a] Try to annotate each movie review example in the dataframe by only using two class labels: 'Positive' or 'Negative'. Write down your annotations as a list of pairs of ID of review, numerical encoding of label, so that (1,0) might mean "example 1 is annotated with label 0. You can select 0 as the negative and 1 as the positive label for convenience. Then compare it with the results of your peers. Can you think of a way of quantifying similarities and differences between your annotations?**

In [16]:
import string
from collections import Counter
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

**[E1b] Imagine doing the task at [E1a] once more, this time having five class labels: 'Positive', 'Somewhat Positive', 'Neutral', 'Somewhat Negative' or 'Negative'. How would you think the differences with respect to your colleagues change? How would you set guidelines for a more robust annotation process, considering also certain arbitration rules for examples with seemingly a lot of different annotation candidates?**

**[E2] In this exercise, you'll be provided with 4 different Turkish newspaper texts: These news belong to 4 different class labels: 'technology', 'sport', 'art and entertainment' and 'economy'.**<br>

**Assuming that none of you speak Turkish, how would you assign a unique label to each of these texts given in the dataframe below? Run the script to see the dataframe and think about the appropriate label for each text.**



In [5]:
import json 
import pandas as pd

pd.set_option('display.max_colwidth', 500)

# open JSON file 
f = open('turkish_news_examples.json',) 
  
# converts JSON object to dictionary,  and then to a dataframe 

news = json.load(f) 

turkish_news_df = pd.DataFrame(news['turkish_news'])

turkish_news_df.at[0, 'label'] = ""
turkish_news_df.at[1, 'label'] = ""
turkish_news_df.at[2, 'label'] = ""
turkish_news_df.at[3, 'label'] = ""


turkish_news_df.style.hide_index()

news_id,review,label
n1,"Fitch'in kararÄ±na yurtdÄ±ÅŸÄ±ndan tepkiler: Fitch'in TÃ¼rkiye'nin kredi derecelendirme notunu yÃ¼kseltmesine yurt dÄ±ÅŸÄ±ndaki ekonomistlerden de olumlu tepkiler geldi. UBS geliÅŸmekte olan piyasalar strateji uzmanÄ± Manik Narain, 'Kredi piyasalarÄ± TÃ¼rkiye ile bir sÃ¼redir yatÄ±rÄ±m yapÄ±labilir notu almÄ±ÅŸ gibi ticaret yapÄ±yordu. BugÃ¼nden itibaren reel bir yatÄ±rÄ±m akÄ±ÅŸÄ± baÅŸlayacak ve kredi masraflarÄ± dÃ¼ÅŸecek' dedi.Â Standart Bank geliÅŸmekte olan piyasalar araÅŸtÄ±rma mÃ¼dÃ¼rÃ¼ Timothy Ash ise 'YatÄ±rÄ±m yapÄ±labilir notu TÃ¼rkiye'ye uzun zaman Ã¶nce verilmeliydi' diye konuÅŸurken Ã¼lkeye dayatÄ±lan dÄ±ÅŸ finansal risklerin uzun sÃ¼redir abartÄ±ldÄ±ÄŸÄ±nÄ± belirtti. Ash, bu durumun TÃ¼rkiye'ye gÃ¼veni arttÄ±racaÄŸÄ±nÄ±, Moody ve S&P'nin de Fitch'i takip etmesini beklediÄŸini sÃ¶zlerine ekledi.",
n2,"Herkes Burak'a Murat Burak'a. BÃ¼yÃ¼kÅŸehir maÃ§Ä±nda en Ã§ok pas '43'le Burak YÄ±lmaz'a atÄ±ldÄ±. G.Saray'daki en iyi futbolunu oynayan Kral da kaleye sÄ±rtÄ± dÃ¶nÃ¼k aldÄ±ÄŸÄ± paslarÄ±, Umut'a asiste Ã§evirdi. G.Saray Teknik DirektÃ¶rÃ¼ Fatih Terim, Cluj sÄ±navÄ±nÄ±n provasÄ±nÄ± yaptÄ±ÄŸÄ± BÃ¼yÃ¼kÅŸehir maÃ§Ä±nda taktik zekasÄ±ndan Ã¶rnekler sergiledi. Fatih Hoca, '3 pasta gol' atma dersi verdi. Cim Bom'un 10 gol giriÅŸiminde ortalama pas sayÄ±sÄ± 3.20 oldu. 3 pasta kaleye gitme taktiÄŸinin odak noktasÄ±ndaki isimse Burak YÄ±lmaz'dÄ±.",
n3,"Bayramda siber suÃ§ kurbanÄ± olmamanÄ±n 7 yolu: DÃ¼nyanÄ±n en bÃ¼yÃ¼k antivirÃ¼s yazÄ±lÄ±m kuruluÅŸlarÄ±ndan ESET, TÃ¼rkiye'de Kurban BayramÄ± dÃ¶neminde seyahatlerin ve kiÅŸisel bilgisayar kullanÄ±mÄ±nÄ±n artmasÄ± nedeniyle kullanÄ±cÄ±larÄ±n siber suÃ§ kurbanÄ± olmamasÄ± iÃ§in uyarÄ±larda bulundu. 'USB gibi taÅŸÄ±nabilir hafÄ±za kartlarÄ±nÄ±n bÃ¶yle dÃ¶nemlerde paylaÅŸÄ±mÄ± artÄ±yor' diyen ESET TÃ¼rkiye Genel MÃ¼dÃ¼r YardÄ±mcÄ±sÄ± Alev Akkoyunlu, 'KontrolsÃ¼z kullanÄ±lan bellekler gÃ¼nÃ¼mÃ¼zÃ¼n en bÃ¼yÃ¼k virÃ¼s bulaÅŸtÄ±rÄ±cÄ±larÄ±' aÃ§Ä±klamasÄ±nÄ± yaptÄ±. Akkoyunlu, bu dÃ¶nemde internette bayram armaÄŸanÄ± olarak ulaÅŸan 'imkansÄ±z' teklifler ve mesajlar konusunda da uyanÄ±k olmaya Ã§aÄŸÄ±rarak bayramda gÃ¼vende kalmak iÃ§in 7 Ã¶neride bulundu.",
n4,"Stok, 17 Ekim'de Sangria Live'da BaÄŸdat Caddesiâ€™nin tek rock barÄ± olan Sangria, sizi genÃ§ ve enerjik bir grupla eÄŸlenceye davet ediyor. 17 Ekim Cumartesi akÅŸamÄ± Stok en yeni ve eÄŸlenceli repertuvarÄ± ile sahnede olacak. Bir yerli iÃ§ki dahil 25 TL olan konser hafta ortasÄ± keyfinizi yerine getirecek. 13 yÄ±ldÄ±r rock severlerin BaÄŸdat Caddesiâ€™ndeki favori mekanÄ± olan Sangria Liveâ€™da konserler son hÄ±zÄ±yla devam ediyor.",


**How do you make your predictions? Do you look at any particular words to decide on the actual label? How does the way that you solved this task relate to how Naive Bayes does classification?**

**[E3] If the document is represented by a certain type of feature, there is a direct relationship between a Naive Bayes classifier and class conditional unigram models. How? Did you use such a relationship while solving [E2]?**<br>

**In other words, you need to consider a given text document represented by a certain type of feature (e.g. words themselves, parts of speech of words, pronunciations etc.), and our text classification model works with the mere counts of each type in the feature vocabulary. Would such a model resemble to what we have experienced so far with unigram models?**



**[E4] What's the problem with correlated features for NB, and how may those occur in natural language?<br>
**
Hint: If you are not sure about how correlated features effect the performance of NB classifiers, you can check out this [link](https://machinelearningmastery.com/better-naive-bayes/) to get some ideas about how to design an NB algorithm.


**[E5] Derive the gradient for binary logistic regression.**

**[E6] Derive the gradient for multi-class logistic regression.**