## Hi, sklearn!

## Reading spam collection

In [1]:
!head -n 40 ./data/1-sms-spam-train.txt

ham	Did he say how fantastic I am by any chance, or anything need a bigger life lift as losing the will 2 live, do you think I would be the first person 2 die from N V Q? 
ham	Black shirt n blue jeans... I thk i c Ã¼...
ham	If e timing can, then i go w u lor...
ham	They r giving a second chance to rahul dengra.
ham	I cant pick the phone right now. Pls send a message
ham	Haha good to hear, I'm officially paid and on the market for an 8th
ham	Ffffffffff. Alright no way I can meet up with you sooner?
ham	But i'm really really broke oh. No amount is too small even  &lt;#&gt; 
ham	Only 2% students solved this CAT question in 'xam... 5+3+2= &lt;#&gt;  9+2+4= &lt;#&gt;  8+6+3= &lt;#&gt;  then 7+2+5=????? Tell me the answer if u r brilliant...1thing.i got d answr.
spam	<Forwarded from 21870000>Hi - this is your Mailbox Messaging SMS alert. You have 4 messages. You have 21 matches. Please call back on 09056242159 to retrieve your messages and matches
ham	No da:)he is stupid da..always sending l

In [2]:
import codecs

with codecs.open('./data/1-sms-spam-train.txt') as f:
    labels, messages = zip(*[line.split('\t') for line in f.readlines()])

#### read test dataset

In [3]:
with codecs.open('./data/1-sms-spam-test.txt') as f:
    kaggle_test_messages = f.readlines()

#### prepare solution

In [4]:
import numpy

In [5]:
import pandas
from IPython.display import FileLink

def create_solution(predictions, filename='1-sms-spam-predictions.csv'):
    result = pandas.DataFrame({'Id': numpy.arange(len(predictions)), 'Label': predictions})
    result.to_csv('data/{}'.format(filename), index=False)
    return FileLink('data/{}'.format(filename))

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
def compute_data_expressions(messages):
    features = []
    # length of each string
    features.append(map(len, messages))
    
    # number of letters, digits, spaces = words
    for pattern in [str.isalpha, str.isdigit, str.isspace]:
        features.append(map(lambda message: sum(map(pattern, message)), messages))
        
    features = numpy.array(features).T
    return features

features = compute_data_expressions(messages)
kaggle_test_features = compute_data_expressions(kaggle_test_messages)

answers = numpy.array(labels) == 'spam' 

In [8]:
features

array([[169, 124,   2,  40],
       [ 45,  26,   0,  11],
       [ 39,  24,   0,  11],
       ..., 
       [ 32,  22,   0,   7],
       [176, 119,  21,  29],
       [ 26,  20,   0,   6]])

In [9]:
from sklearn.neighbors import KNeighborsClassifier
# area under the roc curve
from sklearn.metrics import roc_auc_score
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(features, answers)
roc_auc_score(answers, knn_clf.predict_proba(features)[:, 1])

0.997237808402064

In [10]:
create_solution(knn_clf.predict_proba(kaggle_test_features)[:, 1])

In [11]:
trainX, testX, trainY, testY = train_test_split(features, answers, random_state=42)

## Knn

In [12]:
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(trainX, trainY)
print 'test', roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
# print 'train', roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

test 0.935098650052


## Finding optimal number of neighbours:

In [13]:
for n_neighbors in [1, 2, 4, 8, 16, 32, 64]:
    knn_clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_clf.fit(trainX, trainY)
    print n_neighbors, roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])

1 0.935098650052
2 0.953595534787
4 0.967989211953
8 0.968775239414
16 0.976476866274
32 0.981092073382
64 0.974277431637


### what happens if the metric is changed?

In [14]:
knn_clf = KNeighborsClassifier(metric='canberra', n_neighbors=20)
knn_clf.fit(trainX, trainY)
print roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
print roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

0.983305930541
0.989802797453


## Bag of words

In [15]:
vectorizer = CountVectorizer()
vectorizer.fit(messages)
counts = vectorizer.transform(messages).toarray()
test_counts = vectorizer.transform(kaggle_test_messages).toarray()

In [16]:
vectorizer.fit_transform(messages)
vocab = vectorizer.get_feature_names()
#print vocab
print counts.shape

(3000L, 6294L)


In [17]:
# vocabulary is dictionary which keeps correspondence between columns and words
# vectorizer.vocabulary_

In [18]:
trainX, testX, trainY, testY = train_test_split(counts, answers, random_state=42)

## Naive Bayes

#### gaussian

In [19]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.88849948078920049

#### multinomial

In [41]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.98129398869274254

In [21]:
trainX.shape

(2250L, 6294L)

## Linear regression + Ridge regularization

In [22]:
from sklearn.linear_model import Ridge

In [23]:
ridge_clf = Ridge()
ridge_clf.fit(trainX, trainY)
print roc_auc_score(testY, ridge_clf.predict(testX))
print roc_auc_score(trainY, ridge_clf.predict(trainX))

0.989976347064
1.0


** Exercise #0.** Play with regularization parameter of RidgeRegression, see how it affects quality on train and test.
Check quality of best model by submitting to kaggle.


In [100]:
ridgetest = []
for alpha in numpy.arange(0,500,10):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest.append([alpha,aa,bb,aa+bb])
    #print alpha, roc_auc_score(testY, ridge_clf.predict(testX)), roc_auc_score(trainY, ridge_clf.predict(trainX)) 
ridgetest
## alpha = 20

[[0, 1.0, 0.98146705895927067, 1.9814670589592707],
 [10, 1.0, 0.99443290642667592, 1.9944329064266759],
 [20, 0.99995300828319378, 0.99537037037037046, 1.9953233786535642],
 [30, 0.99979938151671188, 0.99545690550363453, 1.9952562870203465],
 [40, 0.99958611141736053, 0.99522614514826346, 1.9948122565656239],
 [50, 0.99928608737929003, 0.99502422983731398, 1.9943103172166041],
 [60, 0.9989463011193066, 0.99472135687088958, 1.9936676579901962],
 [70, 0.99851795354688067, 0.99438963886004395, 1.9929075924069246],
 [80, 0.99801731102552194, 0.99402907580477673, 1.9920463868302987],
 [90, 0.99754197173628967, 0.99361082266066691, 1.9911527943969567],
 [100, 0.99706121032588768, 0.9932935271720319, 1.9903547374979196],
 [110, 0.99652261295633937, 0.99301949925002886, 1.9895421122063683],
 [120, 0.99604004417221381, 0.99278873889465791, 1.9888287830668716],
 [130, 0.99554843851947172, 0.99241375331718018, 1.9879621918366519],
 [140, 0.99503514438204976, 0.99213972539517714, 1.98717486977722

In [101]:
ridgetest2 = []
for alpha in numpy.arange(20,30,1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest2.append([alpha,aa,bb,aa+bb])
    #print alpha, roc_auc_score(testY, ridge_clf.predict(testX)), roc_auc_score(trainY, ridge_clf.predict(trainX)) 
ridgetest2

[[20, 0.99995300828319378, 0.99537037037037046, 1.9953233786535642],
 [21, 0.99994035666713055, 0.9953992154147917, 1.9953395720819223],
 [22, 0.99992409030362073, 0.99541363793700255, 1.9953377282406233],
 [23, 0.99990240181894086, 0.99545690550363453, 1.9953593073225755],
 [24, 0.9998879428291545, 0.99547132802584515, 1.9953592708549996],
 [25, 0.99987348383936792, 0.99547132802584515, 1.9953448118652131],
 [26, 0.99985721747585798, 0.99548575054805588, 1.9953429680239139],
 [27, 0.99984275848607151, 0.99547132802584515, 1.9953140865119168],
 [28, 0.99983733636490157, 0.9955001730702665, 1.995337509435168],
 [29, 0.99981745525394505, 0.99551459559247724, 1.9953320508464223]]

In [26]:
ridge_clf = Ridge(alpha=24)
ridge_clf.fit(trainX, trainY)
create_solution(ridge_clf.predict(test_counts))

**Exercise #1.** Let's write the correspondence between columns and words (done below). Which words are most popular?

In [24]:
dictionary = numpy.empty(len(vectorizer.vocabulary_), dtype='O')
for word, index in vectorizer.vocabulary_.iteritems():
    dictionary[index] = word


In [25]:
# computing number of times each word met in the training dataset:
word_counts = counts.sum(axis=0)
print word_counts
# printing only words that occured more than 100 times
print dictionary[word_counts > 100]

[ 4 15  1 ...,  1  1  1]
[u'all' u'am' u'and' u'are' u'at' u'be' u'but' u'by' u'call' u'can' u'come'
 u'day' u'do' u'for' u'free' u'from' u'get' u'go' u'good' u'got' u'gt'
 u'have' u'he' u'how' u'if' u'in' u'is' u'it' u'its' u'just' u'know'
 u'like' u'll' u'love' u'lt' u'me' u'my' u'no' u'not' u'now' u'of' u'ok'
 u'on' u'only' u'or' u'out' u'send' u'so' u'text' u'that' u'the' u'then'
 u'there' u'this' u'time' u'to' u'up' u'ur' u'want' u'was' u'we' u'what'
 u'when' u'will' u'with' u'you' u'your']


In [26]:
words_ordered_by_occurences = dictionary[numpy.argsort(word_counts)]
print words_ordered_by_occurences[-50:]

[u'free' u'day' u'good' u'out' u'll' u'go' u'ok' u'from' u'what' u'up'
 u'all' u'when' u'how' u'this' u'gt' u'lt' u'no' u'with' u'or' u'ur' u'get'
 u'just' u'will' u'be' u'we' u'if' u'at' u'but' u'not' u'do' u'so' u'can'
 u'are' u'now' u'on' u'call' u'that' u'of' u'have' u'for' u'your' u'it'
 u'my' u'me' u'in' u'is' u'and' u'the' u'you' u'to']


In [27]:
dist = numpy.sum(counts, axis=0)
#vocab = vectorizer.get_feature_names()
word_frq = []
for tag, count in zip(dictionary, dist):
    word_frq.append([count, tag])
    # print count, tag
word_frq.sort(reverse=True)
word_frq

[[1176, u'to'],
 [1156, u'you'],
 [715, u'the'],
 [510, u'and'],
 [483, u'is'],
 [470, u'in'],
 [427, u'me'],
 [415, u'my'],
 [391, u'it'],
 [373, u'your'],
 [368, u'for'],
 [322, u'have'],
 [321, u'of'],
 [306, u'that'],
 [296, u'call'],
 [286, u'on'],
 [278, u'now'],
 [263, u'are'],
 [260, u'can'],
 [245, u'so'],
 [229, u'not'],
 [229, u'do'],
 [227, u'but'],
 [223, u'at'],
 [220, u'if'],
 [218, u'we'],
 [215, u'be'],
 [207, u'will'],
 [206, u'just'],
 [206, u'get'],
 [205, u'ur'],
 [203, u'with'],
 [203, u'or'],
 [202, u'no'],
 [186, u'lt'],
 [185, u'gt'],
 [179, u'this'],
 [168, u'how'],
 [167, u'when'],
 [164, u'all'],
 [160, u'up'],
 [159, u'what'],
 [158, u'ok'],
 [158, u'from'],
 [157, u'go'],
 [155, u'll'],
 [149, u'out'],
 [147, u'good'],
 [143, u'day'],
 [140, u'free'],
 [136, u'come'],
 [134, u'like'],
 [134, u'know'],
 [130, u'there'],
 [126, u'its'],
 [125, u'time'],
 [125, u'then'],
 [118, u'got'],
 [116, u'was'],
 [115, u'am'],
 [109, u'only'],
 [105, u'send'],
 [104, u

** Exercise #2. ** By analyzing coefficients in `ridge_clf.coef_`, determine which words have the highest impact on decision (= have the largest modulus of `coef_`)

In [28]:
par_impact=zip(dictionary, ridge_clf.coef_)
par_impact_new=sorted(par_impact, key=lambda x: abs(x[1]), reverse=True)

In [29]:
par_impact_new

[(u'88066', 0.33813707390738645),
 (u'85233', 0.26061762654494608),
 (u'08719181503', 0.25921968176431959),
 (u'voicemail', 0.25921968176431959),
 (u'ringtone', 0.23698860702164426),
 (u'service', 0.22746191611798694),
 (u'ltd', 0.22429401256459078),
 (u'barbie', 0.20343010646563695),
 (u'ken', 0.20343010646563695),
 (u'uk', 0.19701038442312094),
 (u'announcement', 0.19687388348255602),
 (u'dating', 0.19223628133933193),
 (u'arsenal', 0.19159700218174794),
 (u'repeat', 0.18998136078779398),
 (u'private', 0.18963359075721684),
 (u'08712402972', 0.18823223942767139),
 (u'minmobsmorelkpobox177hp51fl', 0.18705047981478173),
 (u'claim', 0.18214968976392065),
 (u'50', 0.18052677336596029),
 (u'statement', 0.1804202792460303),
 (u'connected', 0.17416422506079668),
 (u'gmw', 0.17416422506079668),
 (u'2003', 0.17360700675800919),
 (u'150p', 0.17223465970033397),
 (u'truly', 0.17132384892746855),
 (u'12', 0.16848584729728425),
 (u'accordingly', 0.16674157052020661),
 (u'mobile', 0.16630380235474

** Exercise #3. **  Does combining features and counts improve quality? Use `numpy.hstack` to concatenate arrays.
Explain the result.

In [30]:
counts.shape, features.shape, numpy.hstack((counts,features)).shape

combined = numpy.hstack((counts,features))
trainX, testX, trainY, testY = train_test_split(combined, answers, random_state=42)

ridgetest = []
for alpha in [0.01, 0.1, 1, 10, 100, 1000]:
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest.append([alpha,aa,bb,aa+bb])
ridgetest

[[0.01, 1.0, 0.9905388254297911, 1.9905388254297911],
 [0.1, 1.0, 0.9914041767624322, 1.9914041767624322],
 [1, 1.0, 0.99190896503980608, 1.9919089650398061],
 [10, 1.0, 0.99519730010384211, 1.9951973001038421],
 [100, 0.99893907162441331, 0.99714434060228452, 1.9960834122266977],
 [1000, 0.99123062269446893, 0.99610591900311518, 1.9873365416975841]]

In [31]:
ridgetest2 = []
for alpha in numpy.arange(10,100,10):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest2.append([alpha,aa,bb,aa+bb])
ridgetest2

[[10, 1.0, 0.99519730010384211, 1.9951973001038421],
 [20, 0.99999457787882995, 0.9963655244029076, 1.9963601022817374],
 [30, 0.99995662303064048, 0.99666839736933199, 1.9966250203999725],
 [40, 0.99985721747585798, 0.99694242529133492, 1.996799642767193],
 [50, 0.99973612343639584, 0.9970866505134417, 1.9968227739498374],
 [60, 0.99958791879108388, 0.9971299180800739, 1.9967178368711578],
 [70, 0.99945055838811192, 0.99711549555786316, 1.9965660539459751],
 [80, 0.9992734357632268, 0.99714434060228452, 1.9964177763655113],
 [90, 0.99912161637046815, 0.99715876312449525, 1.9962803794949635]]

In [32]:
ridgetest3 = []
for alpha in numpy.arange(20,30,1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest3.append([alpha,aa,bb,aa+bb])
ridgetest3

[[20, 0.99999457787882995, 0.9963655244029076, 1.9963601022817374],
 [21, 0.99999096313138347, 0.99643763701396093, 1.9964286001453444],
 [22, 0.99999096313138347, 0.99645205953617166, 1.9964430226675551],
 [23, 0.99998915575766012, 0.99652417214722511, 1.9965133279048852],
 [24, 0.99998734838393677, 0.99652417214722511, 1.9965115205311619],
 [25, 0.99998192626276683, 0.99656743971385719, 1.996549365976624],
 [26, 0.99997469676787354, 0.99659628475827855, 1.9965709815261521],
 [27, 0.99996927464670371, 0.99663955232491042, 1.996608826971614],
 [28, 0.99996385252553366, 0.99665397484712126, 1.9966178273726549],
 [29, 0.99996385252553366, 0.99663955232491053, 1.9966034048504442]]

In [33]:
ridgetest4 = []
for alpha in numpy.arange(20,21,0.1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest4.append([alpha,aa,bb,aa+bb])
ridgetest4

[[20.0, 0.99999457787882995, 0.9963655244029076, 1.9963601022817374],
 [20.100000000000001,
  0.99999457787883017,
  0.9963655244029076,
  1.9963601022817379],
 [20.200000000000003,
  0.99999457787882995,
  0.9963655244029076,
  1.9963601022817374],
 [20.300000000000004,
  0.99999457787883017,
  0.9963655244029076,
  1.9963601022817379],
 [20.400000000000006,
  0.9999927705051066,
  0.99637994692511822,
  1.9963727174302248],
 [20.500000000000007,
  0.99999277050510682,
  0.99637994692511822,
  1.996372717430225],
 [20.600000000000009,
  0.99999277050510671,
  0.99640879196953969,
  1.9964015624746465],
 [20.70000000000001,
  0.99999277050510682,
  0.9964232144917502,
  1.996415984996857],
 [20.800000000000011,
  0.99999277050510682,
  0.99642321449175042,
  1.9964159849968572],
 [20.900000000000013,
  0.99999096313138347,
  0.99642321449175031,
  1.9964141776231337]]

In [34]:
sorted(ridgetest4,key=lambda x: x[3])


[[20.0, 0.99999457787882995, 0.9963655244029076, 1.9963601022817374],
 [20.200000000000003,
  0.99999457787882995,
  0.9963655244029076,
  1.9963601022817374],
 [20.100000000000001,
  0.99999457787883017,
  0.9963655244029076,
  1.9963601022817379],
 [20.300000000000004,
  0.99999457787883017,
  0.9963655244029076,
  1.9963601022817379],
 [20.400000000000006,
  0.9999927705051066,
  0.99637994692511822,
  1.9963727174302248],
 [20.500000000000007,
  0.99999277050510682,
  0.99637994692511822,
  1.996372717430225],
 [20.600000000000009,
  0.99999277050510671,
  0.99640879196953969,
  1.9964015624746465],
 [20.900000000000013,
  0.99999096313138347,
  0.99642321449175031,
  1.9964141776231337],
 [20.70000000000001,
  0.99999277050510682,
  0.9964232144917502,
  1.996415984996857],
 [20.800000000000011,
  0.99999277050510682,
  0.99642321449175042,
  1.9964159849968572]]

In [35]:
combined_test = numpy.hstack((test_counts,kaggle_test_features))
combined_train = numpy.hstack((counts,features))

ridge_clf = Ridge(alpha=43.5)
ridge_clf.fit(trainX, trainY)
create_solution(ridge_clf.predict(combined_test))

** Exercise #4.** Print examples on which your classifier makes mistakes (both false positive and false negative).

This is important step to understand what can be done to improve the classifier

In [39]:

metrics.f1_score(testY, ridge_clf.predict(testX))

ValueError: Can't handle mix of binary and continuous

In [43]:
ridge_clf.predict("ahdhdk")

TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'

In [44]:
testX

array([[  0,   0,   0, ...,  63,   0,  17],
       [  0,   0,   0, ...,  28,   0,   5],
       [  0,   0,   0, ...,  26,   0,   6],
       ..., 
       [  0,   0,   0, ...,  59,   0,  20],
       [  0,   0,   0, ..., 101,  15,  36],
       [  0,   0,   0, ...,  49,   1,  14]], dtype=int64)

In [37]:
## Confusion matrix
from sklearn import metrics
# testing score
score = metrics.confusion_matrix(testY, ridge_clf.predict(testX))




ValueError: Can't handle mix of binary and continuous

** Exercise #5. (optional, just for fun)**  write a spam SMS, which is not caught by your best model. 
Something like "Send sms YES to 091231323 to activate amazing spam filter, FREE for two weeks, then 20p/day. Txt now!".

Use your knowledge about the structure of the model.

** Major Goal (not in the homework). ** Provide best classification model for the problem. 

You can start with computing new features:
1. Computing occurences of symbols
2. Ignoring the words with digits, dots, etc.
3. Detect links, phones in text

Or start with changing parameters of classifiers. 