# Introdution to Data Mining
## Assignment 2
## Due Date 15 june

Student Name: Daniel Christov Babbev

Student id: 11712624 
***

This assignment should be submitted in the same way as for assignment 1.


### Part 1: Text
Answer each of the following. **`[`*`10 points each`*`]`**

1\. What are two problems with using text in machine learning? Why are these not issues when using numeric features as we have been using throughout class so far?

Answer here!

2\. What is the name of the simplified approach we saw in class for dealing with text that ignores word order and sentence structure? What is one way of correcting for this simplification and how does it work? Why does it help us get some sense of word order? Why is it not a perfect solution?

Answer here!

3\. Some words will be more important that others when building text models. What are two ways of dealing with different levels of importance for words? *Hint: in class we saw one way to totally ignore certain words and one way to downplay particular words. Both of these dealt with "common" words.*

Answer here!

***

### Part 2: Naïve Bayes
1\. From your reading you know that the naive Bayes classifier works by calculating the conditional probabilities of each feature, $e_i$, occuring with each class $c$ and treating them independently. This results in the probability of a certain class occuring given a set of features, or a piece of evidence, $E$, as

$$P(c \mid E) = \frac{p(e_1 \mid c) \cdot p(e_2 \mid c) \cdot \cdot \cdot p(e_k \mid c) \cdot p(c)}{p(E)}.$$

The conditional probability of each piece of evidence occuring with a given class is given by

$$P(e_i \mid c) = \frac{\text{count}(e_i, c)}{\text{count}(c)}.$$

In the above equation $\text{count}(e_i, c)$ is the number of documents in a given class that contain feature $e_i$ and $\text{count}(c)$ is the number of documents that belong to class $c$. 

A common variation of the above is to use Laplace (sometimes called +1) smoothing. Recall the use of Laplace smoothing introduced toward the end of Chapter 3 in the section Probability Estimation. This is done in sklearn by setting `alpha=1` in the `BernoulliNB()` function (this is also the default behavior). The result of Laplace smoothing will slightly change the conditional probabilities,

$$P(e_i \mid c) = \frac{\text{count}(e_i, c) + 1}{\text{count}(c) + 2}.$$

In no more than **one paragraph**, describe why this is useful. Try to think of a case when not using Laplace smoothing would result in "bad" models. Try to give an example. *We discussed this in class. Do you think probabilities or counts of zero might be a problem?* Be precise. **`[`*`15 points`*`]`**

One paragraph answer here.

***

### Part 3: Feature engineering
Answer each of the following. **`[`*`10 points each`*`]`**

1\. Why are non numeric features a problem during the modeling phase? Are non numeric features a problem for all types of models?

Answer here!

2\. Categorical variables that can be thought of as scales (e.g., a measure of satisfaction) can be mapped to a numeric scale. What is an issue that we face when performing this mapping? What is an alternative to creating a mapping?

Answer here!

***

### Part 4: Text mining
We are going to use some data from IMDB movie review to predict if reviews are positive or negative. The file `data/imdb.csv` has two columns:

- `Text`: The text from the IMDB review
- `Class`: Set to 1 if the review is positive and 0 if it is negative

In [1]:
# Run this cell to import the stuff you will need
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn import cross_validation



1\. Read in the data located in `data/imdb.csv`. Don't forget to set the `quotechar` to `"` and `escapechar` to `\`. **`[`*`5 points`*`]`**

In [13]:
data = pd.read_csv("data/imdb.csv", quotechar='"', escapechar='\\')
data["Text"][1]

'Visually disjointed and full of itself the director apparently chose to seek faux-depth to expand a 5 minute plot into an 81 minute snore-fest.  The moments that work in this film are VERY limited and the characters dont even feel real. How could you feel invested in a main protagonist who was made so surreal?  Substantively AND stylistically it all feels like a quirky dream sequence. Jarring irregular camera work awkward silences and gaps in action and whats with the little spider image crawling across the screen? Whoever thought of that needs to go back to film school. It added no meaning just cheese and didnt even stylistically work with the rest of the film (assuming the film even had a style which is a close call). What a flop.'

2\. Put your text data into a variable called `X_text` and your target variable into a variable called `Y`.

In [4]:
X_text = data["Text"]
Y = data["Class"]

3\. What is the base rate of this data set? **`[`*`5 points`*`]`**

In [13]:
# Code here

4\. What is the 5-fold area under the ROC curve for a logistic regression model with default regularization? Your features should consist of all **tf-idf count** text features **exluding** English stop words and keeping the original capitalization (i.e., turn lowercase off). **`[`*`10 points`*`]`**

In [20]:
# Create a vectorizer that will track text as binary features
count_vectorizer = TfidfVectorizer(stop_words="english", lowercase=False)

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
logistic_regression = LogisticRegression()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(logistic_regression, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(round(np.mean(aucs), 3)))

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Area under the ROC curve for our classifier is 0.938


5\. What is the 5-fold area under the ROC curve for a naïve Bayes model with default smoothing? Your features should consist of all **binary count** text features **including** English stop words, converting everything to lowercase, and contain unigrams and bigrams. **`[`*`10 points`*`]`**

In [22]:
# Create a vectorizer that will track text as binary features
count_vectorizer = CountVectorizer(lowercase=True)

# Let the vectorizer learn what tokens exist in the text data
count_vectorizer.fit(X_text)

# Turn these tokens into a numeric matrix
X = count_vectorizer.transform(X_text)

# Create a model
naive_bayes = BernoulliNB()

# Use this model and our data to get 5-fold cross validation AUCs
aucs = cross_validation.cross_val_score(naive_bayes, X, Y, scoring="roc_auc", cv=5)

# Print out the average AUC rounded to three decimal points
print("Area under the ROC curve for our classifier is " + str(np.mean(aucs)))

Area under the ROC curve for our classifier is 0.9201868556285202


6\. Of these two models, which would you prefer? Why? Are there other things we should have tried? What else might you want to do before choosing your final model? **`[`*`5 points`*`]`**

Answer here!