*Goal of machine learning: what class does a particular datum belong to?*

**Document Classification**

each tweet is a *document* and each word/token is a *feature*

so given a labeled document, the algorithm tries to understand which features correspond to the label (target)

**Bayesian Approach**

define $C_i$ to be the ith class/category

define p(x) to the be the probability of observing event x

for document classification, both positive and negative sentiment labels would define the classes

likewise, words or tokens would be events, and p(x) would be the associated probability of observing a particular word

**Bayes Rule**

$$p(c|x) = \frac{p(x|c)p(c)}{p(x)}$$

$p(c|x)$ = probability of event x being in class c

$p(x|c)$ = probability of *generating* event x given class c

$p(c)$ = probability of occurance of class c

$p(d)$ = probability of instance d occuring (usually a pain in the ass to calculate)

$$posterior = \frac{likelihood * prior}{evidence}$$

(evidence is independent of C, hence irrelevant)

$$p(c|x) = \frac{p(x|c)p(c)}{p(x)} \sim p(x|c)p(c)$$

$$p(c_i|x,y) > p(c_j|x,y)$$

choose the class that maximizes the log posterior

**Naive Bayes**

typically used when there are lots of features...

$$p(c|F_1, F_2, .., F_n) = \frac{p(F_1,F_2,..,F_n|c)p(c)}{p(F_1,F_2,..,F_n)}$$

*Definition of conditional probability*

$$p(A|B) = \frac{p(A \cap B)}{p(B)}$$

$$ => p(F_1,F_2,..,F_n,c) $$

*Chain Rule for conditional probability*

$$P(A_4,A_3,A_2,A_1) = P(A_4|A_3,A_2,A_1) * P(A_3|A_2,A_1) * P(A_2,A_1) * P(A_1)$$

$$ p(F_1,F_2,..,F_n,c) \propto p(c) p(F_1|c) p(F_2|c,F_1) \ldots p(F_n|c,F_1,F_2,F_3,\ldots,F_{n-1}) $$

conditional independence - given a class, assume the features are independent

$$p(F_2|c,F_1)=p(F_2|c)$$

$$p(F_n|c,F_1,F_2,F_3,…,F_{n−1}) = p(F_n|c)$$

$$p(c|F_1,F_2,F_3,…,F_n) \propto p(c) \Pi p(F_i|c) $$

*Implementations*

Bernoulli document model - a document is represented by a feature vector with *binary* elements {0,1} indicating the presence or absense of a feature

Multinomial document model - a document is represented by a feature vector with integer elements who value is the frequency of that word in the document

smoothing - need to generalize the results to avoid sample-bias - use Laplace smoothing (ad-hoc nonsense)

logs - take logs to avoid underflow issues

**Applications**

documents -> bag of words + ngrams

http://scikit-learn.org/stable/modules/feature_extraction.html

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# BBB = beginning of sentence marker
# EEE = end of sentence marker
tweets = ["BBB a document is represented by a feature vector EEE"]

In [3]:
unigram_vectorizer = CountVectorizer(min_df=1)
X1 = unigram_vectorizer.fit_transform(tweets)
print unigram_vectorizer.get_feature_names()
print
X1.toarray()

[u'bbb', u'by', u'document', u'eee', u'feature', u'is', u'represented', u'vector']



array([[1, 1, 1, 1, 1, 1, 1, 1]])

In [4]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
X2 = bigram_vectorizer.fit_transform(tweets)
print bigram_vectorizer.get_feature_names()
print
X2.toarray()

[u'a', u'a document', u'a feature', u'bbb', u'bbb a', u'by', u'by a', u'document', u'document is', u'eee', u'feature', u'feature vector', u'is', u'is represented', u'represented', u'represented by', u'vector', u'vector eee']



array([[2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [5]:
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3), token_pattern=r'\b\w+\b', min_df=1)
X3 = trigram_vectorizer.fit_transform(tweets)
print trigram_vectorizer.get_feature_names()
print
X3.toarray()

[u'a', u'a document', u'a document is', u'a feature', u'a feature vector', u'bbb', u'bbb a', u'bbb a document', u'by', u'by a', u'by a feature', u'document', u'document is', u'document is represented', u'eee', u'feature', u'feature vector', u'feature vector eee', u'is', u'is represented', u'is represented by', u'represented', u'represented by', u'represented by a', u'vector', u'vector eee']



array([[2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1]])

In [10]:
tweets = [
    "BBB a document is represented by a feature vector EEE",
    "BBB let's choose a sentence with a completely different set of words (features) so we can waste lots of memory EEE"
]

In [11]:
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3), token_pattern=r'\b\w+\b', min_df=1)
X3 = trigram_vectorizer.fit_transform(tweets)
print trigram_vectorizer.get_feature_names()
print
X3.toarray()

[u'a', u'a completely', u'a completely different', u'a document', u'a document is', u'a feature', u'a feature vector', u'a sentence', u'a sentence with', u'bbb', u'bbb a', u'bbb a document', u'bbb let', u'bbb let s', u'by', u'by a', u'by a feature', u'can', u'can waste', u'can waste lots', u'choose', u'choose a', u'choose a sentence', u'completely', u'completely different', u'completely different set', u'different', u'different set', u'different set of', u'document', u'document is', u'document is represented', u'eee', u'feature', u'feature vector', u'feature vector eee', u'features', u'features so', u'features so we', u'is', u'is represented', u'is represented by', u'let', u'let s', u'let s choose', u'lots', u'lots of', u'lots of memory', u'memory', u'memory eee', u'of', u'of memory', u'of memory eee', u'of words', u'of words features', u'represented', u'represented by', u'represented by a', u's', u's choose', u's choose a', u'sentence', u'sentence with', u'sentence with a', u'set', u'

array([[2, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [2, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1,
        1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

**Logistic Regression + L1 regularization**

**Approach**

http://www.quora.com/What-is-the-difference-between-logistic-regression-and-Naive-Bayes