# Exercise 4

The following code loads a dataset of reviews about hotels and electronic products. Column "Review" contains the review texts and column "Domain" indicates if a review is about a hotel or an electronic product. In this exercise, we will focus on reviews on hotels in this data. The following code creates a new data frame that only contains the rows from the raw data whose "Domain" equals "Hotels". 

In [1]:
import pandas as pd
df = pd.read_csv("classdata/Lies_and_Truths.csv")
df = df[df.Domain=="Hotels"].copy()
df.reset_index(drop=True, inplace=True)
df[["Domain","Review"]].head()

Unnamed: 0,Domain,Review
0,Hotels,I've never taken the time to write a review of...
1,Hotels,I mistakenly thought that since my multiple st...
2,Hotels,We got stuck in Orlando Florida and the airlin...
3,Hotels,GORGEOUS HOTEL! I stayed here for my honeymoon...
4,Hotels,Every one has a dream to enjoy life with his/h...


1. Convert column "Review" to a DTM based on the following requirements:

    - Use the default tokenizer from sklearn library. 
    - Remove stop words in the list of nltk. 
    - Don't stem tokens. 
    - Create DTM with TF scores.

  Save your DTM as a variable called **DTM1**. Print the shape of DTM1.

In [2]:
#Your answer here:

from sklearn.feature_extraction.text import CountVectorizer
import nltk                                  


nltk_stopwords = nltk.corpus.stopwords.words("english")
vectorizer = CountVectorizer(stop_words=nltk_stopwords)
DTM1 = vectorizer.fit_transform(df['Review'])

#Check your answer:
DTM1.shape

(790, 5413)

2. Convert column "Review" to a DTM based on the following requirements:

    - Use the default tokenizer from sklearn library. 
    - Remove stop words in the list of nltk. 
    - Stem the tokens using the SnowBall stemmer from nltk. 
    - Create DTM with TF score.
    
   Save your DTM as a variable called **DTM2**. Print the shape of DTM2. Compare the shape with question 1. 

In [3]:
#Your answer here:

stemmer = nltk.stem.SnowballStemmer("english")  #You may use a different stemmer.
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer = StemmedCountVectorizer(stop_words=nltk_stopwords)
DTM2 =vectorizer.fit_transform(df['Review'])

#Check your answer:
DTM2.shape

(790, 3938)

3.  Use the column sums of DTM2 from question 3 to calculuate the total frequency of each unique term. Save your output as a two-column data frame called **dffreq**, in which the terms are given in column "Term" and their total frequencies are given in column "Frequency". Sort **dffreq** by the total frequencies in a descending order and reset the row index.

In [4]:
#Your answer here:

dffreq = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                   'Frequency': DTM2.sum(axis=0).tolist()[0]
                  })

dffreq.sort_values(by="Frequency",inplace=True,ascending=False)
dffreq.reset_index(inplace=True,drop=True)

#Check your answer:
dffreq.head(10)



Unnamed: 0,Term,Frequency
0,hotel,1356
1,room,1224
2,stay,824
3,br,461
4,staff,377
5,would,297
6,place,286
7,one,285
8,night,273
9,clean,268


4. Convert column "Review" to a DTM based on the following requirements:

    - Use the default tokenizer from sklearn library. 
    - Remove stop words in the list of nltk. 
    - Stem the tokens using the SnowBall stemmer from nltk. 
    - Create DTM with TFIDF score without row normalization.
    
   Save your DTM as a variable called **DTM3**. Print the shape of DTM3.

In [5]:
#Your answer here:
from sklearn.feature_extraction.text import TfidfVectorizer
stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)
DTM3 = vectorizer.fit_transform(df['Review'])

#Check your answer:
DTM3.shape

(790, 3938)

5. Convert column "Review" to a DTM based on the following requirements:

    - Use the default tokenizer from sklearn library. 
    - Remove stop words in the list of nltk. 
    - Stem the tokens using the SnowBall stemmer from nltk. 
    - Create DTM with TF score 
    - Use only bigrams in vocabulary. (*Hint: Set ngram_range*)
    - Use only the top 2000 bigrams.  (*Hint: Set max_features*)
    
   Save your DTM as a variable called **DTM4**. Print the shape of DTM4.

In [8]:
#Your answer here:

vectorizer=StemmedCountVectorizer(stop_words=nltk_stopwords,
                                  ngram_range=(2,2),
                                  max_features = 2000)
DTM4= vectorizer.fit_transform(df['Review'])

#Check your answer:
DTM4.shape

(790, 2000)