## Hi, sklearn!

## Reading spam collection

In [1]:
!head -n 40 ./data/1-sms-spam-train.txt

ham	Did he say how fantastic I am by any chance, or anything need a bigger life lift as losing the will 2 live, do you think I would be the first person 2 die from N V Q? 
ham	Black shirt n blue jeans... I thk i c Ã¼...
ham	If e timing can, then i go w u lor...
ham	They r giving a second chance to rahul dengra.
ham	I cant pick the phone right now. Pls send a message
ham	Haha good to hear, I'm officially paid and on the market for an 8th
ham	Ffffffffff. Alright no way I can meet up with you sooner?
ham	But i'm really really broke oh. No amount is too small even  &lt;#&gt; 
ham	Only 2% students solved this CAT question in 'xam... 5+3+2= &lt;#&gt;  9+2+4= &lt;#&gt;  8+6+3= &lt;#&gt;  then 7+2+5=????? Tell me the answer if u r brilliant...1thing.i got d answr.
spam	<Forwarded from 21870000>Hi - this is your Mailbox Messaging SMS alert. You have 4 messages. You have 21 matches. Please call back on 09056242159 to retrieve your messages and matches
ham	No da:)he is stupid da..always sending l

In [2]:
import codecs

with codecs.open('./data/1-sms-spam-train.txt') as f:
    labels, messages = zip(*[line.split('\t') for line in f.readlines()])

#### read test dataset

In [3]:
with codecs.open('./data/1-sms-spam-test.txt') as f:
    kaggle_test_messages = f.readlines()

#### prepare solution

In [4]:
import numpy

In [5]:
import pandas
from IPython.display import FileLink

def create_solution(predictions, filename='1-sms-spam-predictions.csv'):
    result = pandas.DataFrame({'Id': numpy.arange(len(predictions)), 'Label': predictions})
    result.to_csv('data/{}'.format(filename), index=False)
    return FileLink('data/{}'.format(filename))

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
def compute_data_expressions(messages):
    features = []
    # length of each string
    features.append(map(len, messages))
    
    # number of letters, digits, spaces = words
    for pattern in [str.isalpha, str.isdigit, str.isspace]:
        features.append(map(lambda message: sum(map(pattern, message)), messages))
        
    features = numpy.array(features).T
    return features

features = compute_data_expressions(messages)
kaggle_test_features = compute_data_expressions(kaggle_test_messages)

answers = numpy.array(labels) == 'spam' 

In [15]:
features


array([[169, 124,   2,  40],
       [ 45,  26,   0,  11],
       [ 39,  24,   0,  11],
       ..., 
       [ 32,  22,   0,   7],
       [176, 119,  21,  29],
       [ 26,  20,   0,   6]])

In [16]:
from sklearn.neighbors import KNeighborsClassifier
# area under the roc curve
from sklearn.metrics import roc_auc_score
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(features, answers)
roc_auc_score(answers, knn_clf.predict_proba(features)[:, 1])

0.997237808402064

In [17]:
create_solution(knn_clf.predict_proba(kaggle_test_features)[:, 1])

In [18]:
trainX, testX, trainY, testY = train_test_split(features, answers, random_state=42)

## Knn

In [19]:
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(trainX, trainY)
print 'test', roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
# print 'train', roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

test 0.935098650052


## Finding optimal number of neighbours:

In [20]:
for n_neighbors in [1, 2, 4, 8, 16, 32, 64]:
    knn_clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_clf.fit(trainX, trainY)
    print n_neighbors, roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])

1 0.935098650052
2 0.953595534787
4 0.967989211953
8 0.968775239414
16 0.976476866274
32 0.981092073382
64 0.974277431637


### what happens if the metric is changed?

In [21]:
knn_clf = KNeighborsClassifier(metric='canberra', n_neighbors=20)
knn_clf.fit(trainX, trainY)
print roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
print roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

0.983305930541
0.989802797453


## Bag of words

In [22]:
vectorizer = CountVectorizer()
vectorizer.fit(messages)
counts = vectorizer.transform(messages).toarray()
test_counts = vectorizer.transform(kaggle_test_messages).toarray()

In [23]:
vectorizer.fit_transform(messages)
vocab = vectorizer.get_feature_names()
#print vocab
print counts.shape

(3000L, 6294L)


In [24]:
# vocabulary is dictionary which keeps correspondence between columns and words
# vectorizer.vocabulary_

In [25]:
trainX, testX, trainY, testY = train_test_split(counts, answers, random_state=42)

## Naive Bayes

#### gaussian

In [26]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.88849948078920049

#### multinomial

In [27]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.97836621668397372

In [28]:
trainX.shape

(2250L, 6294L)

## Linear regression + Ridge regularization

In [29]:
from sklearn.linear_model import Ridge

In [30]:
ridge_clf = Ridge()
ridge_clf.fit(trainX, trainY)
print roc_auc_score(testY, ridge_clf.predict(testX))
print roc_auc_score(trainY, ridge_clf.predict(trainX))

0.989976347064
1.0


** Exercise #0.** Play with regularization parameter of RidgeRegression, see how it affects quality on train and test.
Check quality of best model by submitting to kaggle.


In [27]:
ridgetest = []
for alpha in [0.01, 0.1, 1, 10, 100, 1000]:
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest.append([alpha,aa,bb,aa+bb])
    #print alpha, roc_auc_score(testY, ridge_clf.predict(testX)), roc_auc_score(trainY, ridge_clf.predict(trainX)) 
ridgetest
## alpha = 20

[[0.01, 1.0, 0.98390446521287633, 1.9839044652128763],
 [0.1, 1.0, 0.98647167416637827, 1.9864716741663782],
 [1, 1.0, 0.9899763470635744, 1.9899763470635743],
 [10, 1.0, 0.99443290642667592, 1.9944329064266759],
 [100, 0.99706121032588768, 0.9932935271720319, 1.9903547374979196],
 [1000, 0.9762800272551958, 0.97902965270566522, 1.9553096799608611]]

In [28]:
ridgetest2 = []
for alpha in numpy.arange(20,30,1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest2.append([alpha,aa,bb,aa+bb])
    #print alpha, roc_auc_score(testY, ridge_clf.predict(testX)), roc_auc_score(trainY, ridge_clf.predict(trainX)) 
ridgetest2

[[20, 0.99995300828319378, 0.99537037037037046, 1.9953233786535642],
 [21, 0.99994035666713055, 0.9953992154147917, 1.9953395720819223],
 [22, 0.99992409030362073, 0.99541363793700255, 1.9953377282406233],
 [23, 0.99990240181894086, 0.99545690550363453, 1.9953593073225755],
 [24, 0.9998879428291545, 0.99547132802584515, 1.9953592708549996],
 [25, 0.99987348383936792, 0.99547132802584515, 1.9953448118652131],
 [26, 0.99985721747585798, 0.99548575054805588, 1.9953429680239139],
 [27, 0.99984275848607151, 0.99547132802584515, 1.9953140865119168],
 [28, 0.99983733636490157, 0.9955001730702665, 1.995337509435168],
 [29, 0.99981745525394505, 0.99551459559247724, 1.9953320508464223]]

In [29]:
ridge_clf = Ridge(alpha=24)
ridge_clf.fit(trainX, trainY)
create_solution(ridge_clf.predict(test_counts))

**Exercise #1.** Let's write the correspondence between columns and words (done below). Which words are most popular?

In [30]:
dictionary = numpy.empty(len(vectorizer.vocabulary_), dtype='O')
for word, index in vectorizer.vocabulary_.iteritems():
    dictionary[index] = word


In [31]:
# computing number of times each word met in the training dataset:
word_counts = counts.sum(axis=0)
print word_counts
# printing only words that occured more than 100 times
print dictionary[word_counts > 100]

[ 4 15  1 ...,  1  1  1]
[u'all' u'am' u'and' u'are' u'at' u'be' u'but' u'by' u'call' u'can' u'come'
 u'day' u'do' u'for' u'free' u'from' u'get' u'go' u'good' u'got' u'gt'
 u'have' u'he' u'how' u'if' u'in' u'is' u'it' u'its' u'just' u'know'
 u'like' u'll' u'love' u'lt' u'me' u'my' u'no' u'not' u'now' u'of' u'ok'
 u'on' u'only' u'or' u'out' u'send' u'so' u'text' u'that' u'the' u'then'
 u'there' u'this' u'time' u'to' u'up' u'ur' u'want' u'was' u'we' u'what'
 u'when' u'will' u'with' u'you' u'your']


In [32]:
words_ordered_by_occurences = dictionary[numpy.argsort(word_counts)]
print words_ordered_by_occurences[-50:]

[u'free' u'day' u'good' u'out' u'll' u'go' u'ok' u'from' u'what' u'up'
 u'all' u'when' u'how' u'this' u'gt' u'lt' u'no' u'with' u'or' u'ur' u'get'
 u'just' u'will' u'be' u'we' u'if' u'at' u'but' u'not' u'do' u'so' u'can'
 u'are' u'now' u'on' u'call' u'that' u'of' u'have' u'for' u'your' u'it'
 u'my' u'me' u'in' u'is' u'and' u'the' u'you' u'to']


In [33]:
dist = numpy.sum(counts, axis=0)
#vocab = vectorizer.get_feature_names()
word_frq = []
for tag, count in zip(dictionary, dist):
    word_frq.append([count, tag])
    # print count, tag
word_frq.sort(reverse=True)
word_frq

[[1176, u'to'],
 [1156, u'you'],
 [715, u'the'],
 [510, u'and'],
 [483, u'is'],
 [470, u'in'],
 [427, u'me'],
 [415, u'my'],
 [391, u'it'],
 [373, u'your'],
 [368, u'for'],
 [322, u'have'],
 [321, u'of'],
 [306, u'that'],
 [296, u'call'],
 [286, u'on'],
 [278, u'now'],
 [263, u'are'],
 [260, u'can'],
 [245, u'so'],
 [229, u'not'],
 [229, u'do'],
 [227, u'but'],
 [223, u'at'],
 [220, u'if'],
 [218, u'we'],
 [215, u'be'],
 [207, u'will'],
 [206, u'just'],
 [206, u'get'],
 [205, u'ur'],
 [203, u'with'],
 [203, u'or'],
 [202, u'no'],
 [186, u'lt'],
 [185, u'gt'],
 [179, u'this'],
 [168, u'how'],
 [167, u'when'],
 [164, u'all'],
 [160, u'up'],
 [159, u'what'],
 [158, u'ok'],
 [158, u'from'],
 [157, u'go'],
 [155, u'll'],
 [149, u'out'],
 [147, u'good'],
 [143, u'day'],
 [140, u'free'],
 [136, u'come'],
 [134, u'like'],
 [134, u'know'],
 [130, u'there'],
 [126, u'its'],
 [125, u'time'],
 [125, u'then'],
 [118, u'got'],
 [116, u'was'],
 [115, u'am'],
 [109, u'only'],
 [105, u'send'],
 [104, u

** Exercise #2. ** By analyzing coefficients in `ridge_clf.coef_`, determine which words have the highest impact on decision (= have the largest modulus of `coef_`)

In [34]:
par_impact=zip(dictionary, ridge_clf.coef_)
par_impact_new=sorted(par_impact, key=lambda x: abs(x[1]), reverse=True)

In [35]:
par_impact_new

[(u'txt', 0.15285770050925185),
 (u'call', 0.14383880743023994),
 (u'uk', 0.13937018195586354),
 (u'claim', 0.13383817103372986),
 (u'150p', 0.12353058171563081),
 (u'text', 0.12295954893969727),
 (u'service', 0.12164892024913812),
 (u'www', 0.11721507473858522),
 (u'mobile', 0.10871430758407745),
 (u'50', 0.10644871323134385),
 (u'win', 0.10459491583878276),
 (u'reply', 0.10128167797023323),
 (u'won', 0.093737417953707075),
 (u'com', 0.092384004362587854),
 (u'chat', 0.091588166275559629),
 (u'free', 0.090999968625093852),
 (u'customer', 0.089053324721986321),
 (u'ringtone', 0.088830139016501122),
 (u'urgent', 0.083949646478080189),
 (u'stop', 0.083931502140508249),
 (u'88066', 0.079688296076504433),
 (u'awarded', 0.079500482005874562),
 (u'18', 0.079494881727091779),
 (u'prize', 0.07859191579270057),
 (u'or', 0.078184184673863011),
 (u'camera', 0.077228174334375246),
 (u'http', 0.076033905206554037),
 (u'new', 0.075394553927519981),
 (u'ltd', 0.073522346401953959),
 (u'from', 0.07336

** Exercise #3. **  Does combining features and counts improve quality? Use `numpy.hstack` to concatenate arrays.
Explain the result.

In [36]:
counts.shape, features.shape, numpy.hstack((counts,features)).shape

combined = numpy.hstack((counts,features))
trainX, testX, trainY, testY = train_test_split(combined, answers, random_state=42)

ridgetest = []
for alpha in [0.01, 0.1, 1, 10, 100, 1000]:
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest.append([alpha,aa,bb,aa+bb])
ridgetest

[[0.01, 1.0, 0.9905388254297911, 1.9905388254297911],
 [0.1, 1.0, 0.9914041767624322, 1.9914041767624322],
 [1, 1.0, 0.99190896503980608, 1.9919089650398061],
 [10, 1.0, 0.99519730010384211, 1.9951973001038421],
 [100, 0.99893907162441331, 0.99714434060228452, 1.9960834122266977],
 [1000, 0.99123062269446893, 0.99610591900311518, 1.9873365416975841]]

In [135]:
ridgetest2 = []
for alpha in numpy.arange(10,100,10):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest2.append([alpha,aa,bb,aa+bb])
ridgetest2

[[10, 1.0, 0.99920676127841235, 1.9992067612784123],
 [20, 0.99999277050510671, 0.99927887388946568, 1.9992716443945724],
 [30, 0.99995662303064037, 0.99925002884504444, 1.9992066518756848],
 [40, 0.99988252070798445, 0.99922118380062308, 1.9991037045086075],
 [50, 0.99977588565830877, 0.99910580362293755, 1.9988816892812462],
 [60, 0.99964575475022999, 0.99906253605630546, 1.9987082908065354],
 [70, 0.99949032061002474, 0.99901926848967348, 1.9985095890996982],
 [80, 0.99932584960120296, 0.99899042344525202, 1.998316273046455],
 [90, 0.99917764495589112, 0.99899042344525213, 1.9981680684011431]]

In [136]:
ridgetest3 = []
for alpha in numpy.arange(10,20,1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest3.append([alpha,aa,bb,aa+bb])
ridgetest3

[[10, 1.0, 0.99920676127841235, 1.9992067612784123],
 [11, 1.0, 0.99927887388946579, 1.9992788738894658],
 [12, 1.0, 0.99927887388946579, 1.9992788738894658],
 [13, 1.0, 0.99929329641167652, 1.9992932964116765],
 [14, 1.0, 0.99930771893388715, 1.999307718933887],
 [15, 0.99999999999999989, 0.99929329641167641, 1.9992932964116763],
 [16, 0.99999999999999989, 0.99929329641167652, 1.9992932964116763],
 [17, 0.99999638525255341, 0.99929329641167652, 1.99928968166423],
 [18, 0.99999638525255341, 0.99929329641167641, 1.9992896816642298],
 [19, 0.99999457787883017, 0.99927887388946579, 1.999273451768296]]

In [141]:
ridgetest4 = []
for alpha in numpy.arange(14,15,0.1):
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest4.append([alpha,aa,bb,aa+bb])
ridgetest4

[[14.0, 1.0, 0.99930771893388715, 1.999307718933887],
 [14.1, 1.0, 0.99930771893388715, 1.999307718933887],
 [14.199999999999999, 1.0, 0.99929329641167652, 1.9992932964116765],
 [14.299999999999999, 1.0, 0.99929329641167652, 1.9992932964116765],
 [14.399999999999999, 1.0, 0.99929329641167652, 1.9992932964116765],
 [14.499999999999998, 1.0, 0.99929329641167641, 1.9992932964116763],
 [14.599999999999998, 1.0, 0.99929329641167641, 1.9992932964116763],
 [14.699999999999998, 1.0, 0.99929329641167641, 1.9992932964116763],
 [14.799999999999997, 1.0, 0.99929329641167641, 1.9992932964116763],
 [14.899999999999997, 1.0, 0.99929329641167641, 1.9992932964116763]]

In [43]:
sorted(ridgetest4,key=lambda x: x[3])


[[43.300000000000004,
  0.99981745525394516,
  0.99701453790238836,
  1.9968319931563334],
 [43.200000000000003,
  0.9998192626276684,
  0.99701453790238836,
  1.9968338005300568],
 [43.900000000000013,
  0.99980841838532841,
  0.9970289604245991,
  1.9968373788099276],
 [43.0, 0.99982287737511499, 0.99701453790238836, 1.9968374152775032],
 [43.100000000000001,
  0.99982287737511499,
  0.99701453790238836,
  1.9968374152775032],
 [43.400000000000006,
  0.99981564788022181,
  0.99702896042459899,
  1.9968446083048208],
 [43.800000000000011,
  0.99981022575905176,
  0.99704338294680983,
  1.9968536087058615],
 [43.600000000000009,
  0.99981384050649846,
  0.99704338294680972,
  1.9968572234533082],
 [43.70000000000001,
  0.99981384050649846,
  0.99704338294680983,
  1.9968572234533082],
 [43.500000000000007,
  0.99981564788022181,
  0.99704338294680972,
  1.9968590308270315]]

In [140]:
#combined_test = numpy.hstack((test_counts,kaggle_test_features))
#combined_train = numpy.hstack((counts,features))

ridge_clf = Ridge(alpha=14)
ridge_clf.fit(trainX, trainY)
create_solution(ridge_clf.predict(combined_test))

In [None]:
## trying to improve

In [129]:
vectorizer = CountVectorizer(token_pattern='\\b\\w+\\b')  
vectorizer.fit(messages)
counts = vectorizer.transform(messages).toarray()
test_counts = vectorizer.transform(kaggle_test_messages).toarray()
vectorizer.fit_transform(messages)
vocab = vectorizer.get_feature_names()
combined = numpy.hstack((counts,features))
trainX, testX, trainY, testY = train_test_split(combined, answers, random_state=42)


In [134]:
ridgetest = []
for alpha in [0.01, 0.1, 1, 10, 100, 1000]:
    ridge_clf = Ridge(alpha)
    ridge_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, ridge_clf.predict(trainX))
    bb = roc_auc_score(testY, ridge_clf.predict(testX))
    ridgetest.append([alpha,aa,bb,aa+bb])
ridgetest

[[0.01, 1.0, 0.9959040036921657, 1.9959040036921656],
 [0.1, 1.0, 0.99750490365755162, 1.9975049036575516],
 [1, 1.0, 0.99854332525672096, 1.9985433252567208],
 [10, 1.0, 0.99920676127841235, 1.9992067612784123],
 [100, 0.99901317394706923, 0.9989760009230414, 1.9979891748701106],
 [1000, 0.99203851874879123, 0.99663955232491064, 1.9886780710737018]]

In [130]:
words_avg=[]
for i in range(len(messages)):
    words = messages[i].split()
    average = sum(len(word) for word in words)/float(len(words))
    words_avg.append([average])
#words_avg_arr=numpy.asarray(words_avg)
combined2 = numpy.hstack((combined, words_avg))

In [138]:

words_avg=[]
for i in range(len(kaggle_test_messages)):
    words = messages[i].split()
    average = sum(len(word) for word in words)/float(len(words))
    words_avg.append([average])
#words_avg_arr=numpy.asarray(words_avg)
#combined2 = numpy.hstack((test_counts, words_avg))

In [139]:

combined_test = numpy.hstack((test_counts,kaggle_test_features,words_avg))

In [133]:

trainX, testX, trainY, testY = train_test_split(combined2, answers, random_state=42)

In [146]:
from sklearn.naive_bayes import MultinomialNB
nb_test = []
for alpha in numpy.arange(0.001, 0.01, 0.001):
    nb_clf = MultinomialNB(alpha)
    nb_clf.fit(trainX, trainY)
    aa = roc_auc_score(trainY, nb_clf.predict_proba(trainX)[:,1])
    bb = roc_auc_score(testY, nb_clf.predict_proba(testX)[:,1])
    nb_test.append([alpha,aa,bb,aa+bb])
nb_test


[[0.001, 0.99784109208749849, 0.98329871928002766, 1.9811398113675263],
 [0.002, 0.99772903491665288, 0.98324102919118506, 1.9809700641078378],
 [0.0030000000000000001,
  0.99765945102830533,
  0.98270018460828434,
  1.9803596356365896],
 [0.0040000000000000001,
  0.9975952892611275,
  0.98252711434175621,
  1.9801224036028837],
 [0.0050000000000000001,
  0.99754468279687469,
  0.98226029768085843,
  1.9798049804777331],
 [0.0060000000000000001,
  0.99748865421145183,
  0.9825703819083883,
  1.9800590361198402],
 [0.0070000000000000001,
  0.99744798830267722,
  0.98227472020306916,
  1.9797227085057463],
 [0.0080000000000000002,
  0.99740370764645603,
  0.98249105803622938,
  1.9798947656826855],
 [0.0090000000000000011,
  0.99736213805081964,
  0.98608226606668969,
  1.9834444041175092]]

** Exercise #4.** Print examples on which your classifier makes mistakes (both false positive and false negative).

This is important step to understand what can be done to improve the classifier

In [46]:
#prediction results
pred = ridge_clf.predict(testX)

#change it to binary results
pred_bin = (pred>0.5)



In [47]:
## Confusion matrix
from sklearn import metrics
# testing score
metrics.confusion_matrix(testY, pred_bin)


array([[642,   0],
       [ 13,  95]])

In [40]:
(pred<0).sum()

231

## results show that large number of false positive. What should be used to get a score that considers false positive?

** Exercise #5. (optional, just for fun)**  write a spam SMS, which is not caught by your best model. 
Something like "Send sms YES to 091231323 to activate amazing spam filter, FREE for two weeks, then 20p/day. Txt now!".

Use your knowledge about the structure of the model.

** Major Goal (not in the homework). ** Provide best classification model for the problem. 

You can start with computing new features:
1. Computing occurences of symbols
2. Ignoring the words with digits, dots, etc.
3. Detect links, phones in text

Or start with changing parameters of classifiers. 