# Data Preprocessing

The goal of this lab is to introduce you to data preprocessing techniques in order to make your data suitable for applying a learning algorithm.

## 1. Handling Missing Values

A common (and very unfortunate) data property is the ocurrence of missing and erroneous values in multiple features in our dataset.
Download the dataset and corresponding information from the <a href="http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv">course website</a>.

To determine the age of a abalone snail you have to kill the snail and count the annual
rings. You are told to estimate the age of a snail on the basis of the following attributes:
1. type: male (0), female (1) and infant (2)
2. length in mm
3. width in mm
4. height in mm
5. total weight in grams
6. weight of the meat in grams
7. drained weight in grams
8. weight of the shell in grams
9. number of annual rings (number of rings +1, 5 yields age)

However, these data is incomplete. Missing values are marked with −1.

In [1]:
import pandas as pd

# load data 
df = pd.read_csv("http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv")
df.columns=['type','length','width','height','total_weight','meat_weight','drained_weight','shell_weight','num_rings']
df.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,-1
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,2,-1.0,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


### Exercise 1.1

Compute the mean of all positive numbers of each numeric column and the counts of each category.

In [16]:
#Count
count_ = df[df>0].count()
print('\nCount:\n', count_)

Mean:
 type              1.505987
length            0.523692
width             0.407955
height            0.139679
total_weight      0.828843
meat_weight       0.359263
drained_weight    0.180249
shell_weight      0.238604
num_rings         9.921756
dtype: float64

Count:
 type              2589
length            4052
width             4052
height            4050
total_weight      4070
meat_weight       4051
drained_weight    4067
shell_weight      4074
num_rings         4077
dtype: int64


In [18]:
#Mean
mean_ = df[df>0].mean()
print('Mean:\n', mean_)

Mean:
 type              1.505987
length            0.523692
width             0.407955
height            0.139679
total_weight      0.828843
meat_weight       0.359263
drained_weight    0.180249
shell_weight      0.238604
num_rings         9.921756
dtype: float64


### Exercise 1.2

Compute the median of all positive numbers of each numeric column.

In [17]:
#Median
median_ = df[df>0].median()
print('Median:\n', median_ )

Median:
 type              2.00000
length            0.54500
width             0.42500
height            0.14000
total_weight      0.80175
meat_weight       0.33600
drained_weight    0.17050
shell_weight      0.23350
num_rings         9.00000
dtype: float64


### Exercise 1.3

Handle the missing values in a way that you find suitable. Argue your choices.

Polynomial Interpolation
If a set of data contains n known points, 
then there exists exactly one polynomial of degree n-1 or smaller that passes through all of those points. The polynomial's graph can be thought of as "filling in the curve" to account for data between the known points. 


In [43]:
ind_interpolate = df[df>0].interpolate(method='index', limit_direction='backward')
ind_interpolate = ind_interpolate.interpolate(method='index')
ind_interpolate.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,1.0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,9.0
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,1.5,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,0.4325,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


In [42]:
#interpolate_df = .interpolate()
#interpolate_df
lin_interpolate = df[df>0].interpolate(method='linear', limit_direction='backward')
lin_interpolate = lin_interpolate.interpolate(method='linear')
lin_interpolate.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,1.0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,9.0
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,1.5,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,0.4325,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


In [49]:
pol_interpolate = df[df>0].interpolate(method='polynomial', order=2, limit_direction='backward')
pol_interpolate = pol_interpolate.interpolate(method='polynomial', order=2)
pol_interpolate.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,1.610568,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,0.38718,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


### Exercise 1.4

Perform Z-score normalization on every column (except the type of course!)

In [54]:
from scipy.stats import zscore
zscores_df = df.drop('type', axis=1).apply(zscore)

zscores_df.head()

Unnamed: 0,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,-0.451527,-0.391702,-0.079751,-0.990125,-0.686952,-0.465375,-0.587604,-2.965564
1,0.181217,0.208544,0.147804,-0.187423,-0.19465,-0.036616,0.007003,-0.184349
2,-0.135155,-0.004447,0.097236,-0.473658,-0.323213,-0.1634,-0.226593,0.093773
3,-5.197109,-0.430428,-0.130318,-1.026571,-0.718309,-0.506868,-0.651312,-0.740592
4,-0.187884,-0.256163,-0.054467,-0.766115,-0.556821,-0.331676,-0.375245,-0.46247


## 2. Preprocessing text (Optional)

One possible way to transform text documents into vectors of numeric attributes is to use the TF-IDF representation. We will experiment with this representation using the 20 Newsgroup data set. The data set contains postings on 20 different topics. The classification problem is to decide which of the topics a posting falls into. Here, we will only consider postings about medicine and space.

In [55]:
from sklearn.datasets import fetch_20newsgroups


categories = ['sci.med', 'sci.space']
raw_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print('The index of each category is: {}'.format([(i,target) for i,target in enumerate(raw_data.target_names)]))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


The index of each category is: [(0, 'sci.med'), (1, 'sci.space')]


Check out some of the postings, might find some funny ones!

In [60]:
import numpy as np
idx = np.random.randint(0, len(raw_data.data))
print ('This is a {} email.\n'.format(raw_data.target_names[raw_data.target[idx]]))
print ('There are {} emails.\n'.format(len(raw_data.data)))
print(raw_data.data[idx])

This is a sci.med email.

There are 1187 emails.

From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: tuberculosis
Reply-To: geb@cs.pitt.edu (Gordon Banks)
Organization: Univ. of Pittsburgh Computer Science
Lines: 17

In article <206@ky3b.UUCP> km@ky3b.pgh.pa.us (Ken Mitchum) writes:
>
>I found out that tuberculosis appears to be the only MEDICAL (as oppsed to psychiatric)
>condition that one can be committed for, and this is because very specific laws were
>enacted many years ago regarding tb. I am certain these vary from state to state.

I think in Illinois venereal disease (the old ones, not AIDS) was included.
Syphillis was, for sure.




-- 
----------------------------------------------------------------------------
Gordon Banks  N3JXP      | "Skepticism is the chastity of the intellect, and
geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon." 
----------------------------------------------------------------------------



Lets pick the first 10 postings from each category

In [73]:
idxs_med = np.flatnonzero(raw_data.target == 0)
idxs_space = np.flatnonzero(raw_data.target == 1)
idxs = np.concatenate([idxs_med[:10],idxs_space[:10]])
print(idxs)
data = np.array(raw_data.data)
data = data[idxs]


[ 1  2  3  5 11 12 13 14 15 16  0  4  6  7  8  9 10 18 20 22]


<a href="http://www.nltk.org/">NLTK</a> is a toolkit for natural language processing. Take some time to install it and go through this <a href="http://www.slideshare.net/japerk/nltk-in-20-minutes">short tutorial/presentation</a>.

The downloaded package below is a tokenizer that divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [78]:
import nltk
import itertools
nltk.download('punkt')

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in data]
vocabulary_size = 1000
unknown_token = 'unknown'

[nltk_data] Downloading package punkt to /Users/alyona/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [82]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print ("Found %d unique words tokens." % len(word_freq.items()))

Found 1641 unique words tokens.


In [90]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
 
print ("Using vocabulary size %d." % vocabulary_size)
print ("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

Using vocabulary size 1000.
The least frequent word in our vocabulary is 'REASONS' and appeared 1 times.


In [104]:
word_to_index

{':': 0,
 '.': 1,
 ',': 2,
 '--': 3,
 '>': 4,
 'the': 5,
 ')': 6,
 'to': 7,
 '(': 8,
 'of': 9,
 '@': 10,
 'a': 11,
 'and': 12,
 'I': 13,
 'that': 14,
 'is': 15,
 'in': 16,
 'it': 17,
 'be': 18,
 '?': 19,
 'for': 20,
 '!': 21,
 'this': 22,
 "n't": 23,
 'are': 24,
 "'s": 25,
 'From': 26,
 'do': 27,
 'Subject': 28,
 'Organization': 29,
 'Lines': 30,
 "''": 31,
 'on': 32,
 'have': 33,
 'as': 34,
 'not': 35,
 '``': 36,
 'you': 37,
 'In': 38,
 'an': 39,
 'was': 40,
 'we': 41,
 'Re': 42,
 'The': 43,
 '-': 44,
 '<': 45,
 'would': 46,
 'if': 47,
 'o': 48,
 'writes': 49,
 'will': 50,
 'It': 51,
 'but': 52,
 'or': 53,
 'they': 54,
 'Space': 55,
 'article': 56,
 'may': 57,
 'with': 58,
 'food': 59,
 'by': 60,
 'what': 61,
 '...': 62,
 'see': 63,
 'like': 64,
 'should': 65,
 'can': 66,
 'there': 67,
 'some': 68,
 'about': 69,
 'at': 70,
 'know': 71,
 'up': 72,
 'who': 73,
 'Griffin': 74,
 'out': 75,
 'which': 76,
 'Is': 77,
 'diet': 78,
 'one': 79,
 'inflammation': 80,
 'used': 81,
 'stage': 82,
 '

In [102]:
wor

[':',
 '.',
 ',',
 '--',
 '>',
 'the',
 ')',
 'to',
 '(',
 'of',
 '@',
 'a',
 'and',
 'I',
 'that',
 'is',
 'in',
 'it',
 'be',
 '?',
 'for',
 '!',
 'this',
 "n't",
 'are',
 "'s",
 'From',
 'do',
 'Subject',
 'Organization',
 'Lines',
 "''",
 'on',
 'have',
 'as',
 'not',
 '``',
 'you',
 'In',
 'an',
 'was',
 'we',
 'Re',
 'The',
 '-',
 '<',
 'would',
 'if',
 'o',
 'writes',
 'will',
 'It',
 'but',
 'or',
 'they',
 'Space',
 'article',
 'may',
 'with',
 'food',
 'by',
 'what',
 '...',
 'see',
 'like',
 'should',
 'can',
 'there',
 'some',
 'about',
 'at',
 'know',
 'up',
 'who',
 'Griffin',
 'out',
 'which',
 'Is',
 'diet',
 'one',
 'inflammation',
 'used',
 'stage',
 'any',
 'does',
 'Crohn',
 'has',
 'things',
 'But',
 'my',
 'them',
 'more',
 '3',
 'all',
 '*',
 'space',
 '$',
 'billion',
 'Reply-To',
 'Ken',
 'anyone',
 'their',
 "'d",
 'cause',
 'problems',
 'Steve',
 'because',
 'been',
 'told',
 'doctor',
 'now',
 'had',
 'good',
 'anything',
 'point',
 'when',
 'think',
 'just'

### Exercise 2.1

Code your own TF-IDF representation function and use it on this dataset. (Don't use code from libraries. Build your own function with Numpy/Pandas). Use the formular TFIDF = TF * (IDF+1). The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

In [100]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
df = pd.DataFrame(countvec.fit_transform(data).toarray(), columns=countvec.get_feature_names())
print(df)

    02  041300  07  0815  10  101  10511  11  115397  12  ...  yellow  \
0    0       0   0     0   0    0      0   0       0   0  ...       0   
1    0       0   0     0   0    0      0   0       0   0  ...       0   
2    0       0   0     0   0    0      0   1       0   0  ...       0   
3    0       0   0     0   0    0      0   0       0   0  ...       0   
4    0       0   0     0   0    0      0   0       0   0  ...       0   
5    0       0   0     0   0    0      0   0       0   0  ...       0   
6    0       0   0     0   0    0      0   0       0   0  ...       0   
7    0       0   0     0   1    0      0   0       0   0  ...       0   
8    0       0   0     0   0    0      0   0       0   0  ...       3   
9    0       0   0     0   0    0      0   0       0   0  ...       0   
10   0       0   0     0   0    0      0   0       0   1  ...       0   
11   0       0   0     0   0    0      0   1       0   0  ...       0   
12   0       0   0     1   0    0      0   0       

In [108]:
for ind, row in df.iterrows():
    print(row[6])

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0


In [148]:
for doc in df.iloc[1,:]:
    print(doc)

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
2
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
2
0
0
0
0
1
0
0
0
1
0
1
1
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [160]:
for ind, doc in df.iterrows():
    for ind2,term in enumerate(doc):
        idf_1 = df.iloc[:,ind2].gt(0).sum() + 1

1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
1
1
1
3
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
4
7
1
2
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
1
1
1
1
1
1
3
1
1
1
1
3
1
1
1
1
2
1
10
1
1
1
3
1
1
1
1
1
1
2
1
9
1
1
18
1
1
1
1
1
6
1
5
3
1
1
1
1
2
1
11
1
3
1
1
10
1
7
1
1
1
1
1
1
8
1
1
1
1
2
1
1
1
1
2
1
3
1
1
2
1
1
1
1
1
11
1
1
2
3
1
2
1
1
1
3
1
1
1
3
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
10
1
1
1
5
1
1
1
2
2
1
1
1
4
2
1
6
1
1
1
2
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
2
2
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
9
1
1
1
1
2
2
1
1
1
1
1
1
1
1
3
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
2
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
1
1
1
1
1
1
2
2
1
1
1
3
2
1
1
1
1
1
1
1
2
1
1
2
1
2
1
8
4
4
3
1
9
1
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
2
2
1
15
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
2
3
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
2
1
1
2
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
3
1

4
1
1
1
1
1
3
1
7
1
4
2
2
5
4
6
2
1
4
2
1
6
1
1
1
6
3
1
1
2
1
1
1
3
1
3
5
1
10
3
1
1
1
1
1
1
1
1
1
9
2
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
1
1
1
3
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
4
7
1
2
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
1
1
1
1
1
1
3
1
1
1
1
3
1
1
1
1
2
1
10
1
1
1
3
1
1
1
1
1
1
2
1
9
1
1
18
1
1
1
1
1
6
1
5
3
1
1
1
1
2
1
11
1
3
1
1
10
1
7
1
1
1
1
1
1
8
1
1
1
1
2
1
1
1
1
2
1
3
1
1
2
1
1
1
1
1
11
1
1
2
3
1
2
1
1
1
3
1
1
1
3
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
10
1
1
1
5
1
1
1
2
2
1
1
1
4
2
1
6
1
1
1
2
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
2
2
1
1
1
2
2
1
1
1
1
1
1
1
1
1
1
1
1
9
1
1
1
1
2
2
1
1
1
1
1
1
1
1
3
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
2
1
2
1
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
2
4
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
2
1
1
1
1
1
1
2
2
1
1
1
3
2
1
1
1
1
1
1
1
2
1
1
2
1
2
1
8
4
4
3
1
9
1
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
2
2
1
15
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1


1
1
1
8
1
2
1
1
1
1
3
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
1
1
5
1
1
2
9
1
2
2
2
3
1
1
1
1
1
1
2
8
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
3
1
1
1
2
2
1
1
1
1
1
1
1
1
20
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
4
1
14
20
5
3
1
3
1
1
2
8
4
3
10
1
2
4
5
1
1
16
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
2
19
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
3
2
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
2
3
5
1
1
5
1
1
1
2
2
1
2
3
1
1
2
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
6
1
1
1
1
4
1
1
1
1
1
3
1
7
1
4
2
2
5
4
6
2
1
4
2
1
6
1
1
1
6
3
1
1
2
1
1
1
3
1
3
5
1
10
3
1
1
1
1
1
1
1
1
1
9
2
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
1
1
1
3
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
4
7
1
2
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
1
1
1
1
1
1
3
1
1
1
1
3
1
1
1
1
2
1
10
1
1
1
3
1
1
1
1
1
1
2
1
9
1
1
18
1
1
1
1
1
6
1
5
3
1
1
1
1
2
1
11
1
3
1
1
10
1
7
1
1
1
1
1
1
8
1
1
1
1
2
1
1
1
1
2
1
3
1
1
2
1
1
1
1
1

1
1
2
1
1
1
3
1
1
1
1
2
3
1
2
1
2
1
5
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
8
1
2
1
1
1
1
3
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
1
1
5
1
1
2
9
1
2
2
2
3
1
1
1
1
1
1
2
8
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
3
1
1
1
2
2
1
1
1
1
1
1
1
1
20
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
4
1
14
20
5
3
1
3
1
1
2
8
4
3
10
1
2
4
5
1
1
16
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
2
19
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
3
2
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
2
3
5
1
1
5
1
1
1
2
2
1
2
3
1
1
2
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
6
1
1
1
1
4
1
1
1
1
1
3
1
7
1
4
2
2
5
4
6
2
1
4
2
1
6
1
1
1
6
3
1
1
2
1
1
1
3
1
3
5
1
10
3
1
1
1
1
1
1
1
1
1
9
2
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
1
1
1
3
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
4
7
1
2
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
1
1
1
1
1
1
3
1
1
1
1
3
1
1
1
1
2
1
10
1
1
1
3
1
1
1
1
1
1
2
1
9
1
1
18
1
1
1
1
1
6
1
5
3
1
1
1
1
2
1
1

1
1
1
2
1
3
1
1
1
1
1
5
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
1
1
1
2
1
1
1
3
1
1
1
1
2
3
1
2
1
2
1
5
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
8
1
2
1
1
1
1
3
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
1
1
5
1
1
2
9
1
2
2
2
3
1
1
1
1
1
1
2
8
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
3
1
1
1
2
2
1
1
1
1
1
1
1
1
20
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
4
1
14
20
5
3
1
3
1
1
2
8
4
3
10
1
2
4
5
1
1
16
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
2
19
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
3
2
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1
1
2
3
5
1
1
5
1
1
1
2
2
1
2
3
1
1
2
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
1
1
1
4
3
1
1
1
1
1
1
1
1
1
1
3
1
1
1
1
6
1
1
1
1
4
1
1
1
1
1
3
1
7
1
4
2
2
5
4
6
2
1
4
2
1
6
1
1
1
6
3
1
1
2
1
1
1
3
1
3
5
1
10
3
1
1
1
1
1
1
1
1
1
9
2
1
1
1
1
2
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
2
2
1
1
1
1
3
1
1
1
1
1
1
1
3
3
1
1
1
1
1
1
1
1
2
1
1
1
1
2
1
1
1
1
1
1
1
1
4
1
1
1
1
1
1
1
1
1
4
7
1
2
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
1
2
1
1
1
1

1
1
1
1
1
5
1
1
1
1
1
6
1
1
1
1
1
1
1
5
1
1
3
1
1
2
1
1
1
1
1
1
7
3
1
1
1
1
1
12
1
4
2
1
1
1
2
1
1
1
1
1
1
16
1
4
2
1
2
1
11
5
1
1
4
1
1
1
2
1
8
1
1
1
1
1
1
20
1
2
1
1
1
1
1
4
1
3
1
1
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
3
1
2
1
1
2
1
1
2
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
1
1
1
1
2
1
4
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
14
1
1
3
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
3
1
1
1
1
1
5
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
1
1
1
2
1
1
1
3
1
1
1
1
2
3
1
2
1
2
1
5
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
8
1
2
1
1
1
1
3
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
1
1
5
1
1
2
9
1
2
2
2
3
1
1
1
1
1
1
2
8
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
3
1
1
1
2
2
1
1
1
1
1
1
1
1
20
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
4
1
14
20
5
3
1
3
1
1
2
8
4
3
10
1
2
4
5
1
1
16
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
2
19
1
1
1
3
1
1
1
1
1
1
1
1
1
2
1
1
1
3
2
1
1
1
1
1
1
3
1
1
1
1
1
1
1
1

1
1
1
4
2
2
1
1
1
1
2
1
2
2
1
1
1
1
1
1
1
3
1
1
1
2
1
2
1
1
2
1
1
1
1
5
1
3
1
1
1
5
2
1
2
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
4
1
3
2
2
2
1
1
1
1
2
1
1
1
1
1
1
1
5
1
1
1
1
1
6
1
1
1
1
1
1
1
5
1
1
3
1
1
2
1
1
1
1
1
1
7
3
1
1
1
1
1
12
1
4
2
1
1
1
2
1
1
1
1
1
1
16
1
4
2
1
2
1
11
5
1
1
4
1
1
1
2
1
8
1
1
1
1
1
1
20
1
2
1
1
1
1
1
4
1
3
1
1
6
1
1
1
1
1
1
1
1
1
1
1
1
1
1
3
1
1
1
3
1
2
1
1
2
1
1
2
1
1
1
1
1
1
2
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
2
1
1
1
8
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
3
1
1
1
1
1
2
1
4
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
14
1
1
3
1
3
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
3
1
1
1
1
1
5
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
2
1
1
1
1
1
1
1
2
1
1
1
3
1
1
1
1
2
3
1
2
1
2
1
5
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
1
1
1
8
1
2
1
1
1
1
3
1
1
1
1
3
3
1
1
1
1
1
1
1
3
1
1
1
5
1
1
2
9
1
2
2
2
3
1
1
1
1
1
1
2
8
1
1
1
1
1
1
1
1
1
1
2
1
1
1
2
1
1
3
1
1
1
2
2
1
1
1
1
1
1
1
1
20
1
2
1
2
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
3
1
1
1
1
1


In [172]:
len(df)

20

In [168]:
def tfidf(x):
    return x*(np.log(len(x)/sum(x>0))+1)

In [170]:
rep = df.apply(tfidf)

# Check if your implementation is correct
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False, use_idf=True)
X_train = pd.DataFrame(vectorizer.fit_transform(data).toarray(), columns=countvec.get_feature_names())
answer=['No','Yes']
if rep is not None:
    print ('Is this implementation correct?\nAnswer: {}'.format(answer[1*np.all(X_train == rep)]))

Is this implementation correct?
Answer: Yes
