# Data Preprocessing

The goal of this lab is to introduce you to data preprocessing techniques in order to make your data suitable for applying a learning algorithm.

## 1. Handling Missing Values

A common (and very unfortunate) data property is the ocurrence of missing and erroneous values in multiple features in our dataset.
Download the dataset and corresponding information from the <a href="http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv">course website</a>.

To determine the age of a abalone snail you have to kill the snail and count the annual
rings. You are told to estimate the age of a snail on the basis of the following attributes:
1. type: male (0), female (1) and infant (2)
2. length in mm
3. width in mm
4. height in mm
5. total weight in grams
6. weight of the meat in grams
7. drained weight in grams
8. weight of the shell in grams
9. number of annual rings (number of rings +1, 5 yields age)

However, these data is incomplete. Missing values are marked with −1.

In [1]:
import pandas as pd

# load data 
df = pd.read_csv("http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv")
df.columns=['type','length','width','height','total_weight','meat_weight','drained_weight','shell_weight','num_rings']
df.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,-1
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,2,-1.0,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


### Exercise 1.1

Compute the mean of all positive numbers of each numeric column and the counts of each category.

In [12]:
##################
#INSERT CODE HERE#
##################
for column in df.columns:
    if column != 'type':
        print("\n\nColumn name: " + column + "   Mean: " + str(df[column][df[column] > 0].mean()))
    



Column name: length   Mean: 0.5236920039486674


Column name: width   Mean: 0.40795533070089013


Column name: height   Mean: 0.13967901234567806


Column name: total_weight   Mean: 0.8288428746928771


Column name: meat_weight   Mean: 0.3592626511972346


Column name: drained_weight   Mean: 0.18024858618146095


Column name: shell_weight   Mean: 0.23860444280805088


Column name: num_rings   Mean: 9.921756193279371


### Exercise 1.2

Compute the median of all positive numbers of each numeric column.

In [13]:
##################
#INSERT CODE HERE#
##################
for column in df.columns:
    if column != 'type':
        print("\n\nColumn name: " + column + "   Median: " + str(df[column][df[column] > 0].median()))



Column name: length   Median: 0.545


Column name: width   Median: 0.425


Column name: height   Median: 0.14


Column name: total_weight   Median: 0.80175


Column name: meat_weight   Median: 0.336


Column name: drained_weight   Median: 0.1705


Column name: shell_weight   Median: 0.2335


Column name: num_rings   Median: 9.0


### Exercise 1.3

Handle the missing values in a way that you find suitable. Argue your choices.

In [None]:
##################
#INSERT CODE HERE#
##################

### Exercise 1.4

Perform Z-score normalization on every column (except the type of course!)

In [19]:
##################
#INSERT CODE HERE#
##################
for column in df.columns:
    if column != 'type':
        val = df[column][df[column] > 0]
        mean = val.median()
        std = val.std() 
        print("\n\nColumn name: " + column + "   \nZ-score: \n" + str(((val - mean) / std)))



Column name: length   
Z-score: 
0      -1.621138
1      -0.124703
2      -0.872920
4      -0.997623
5      -0.124703
6       0.000000
7      -0.581947
8       0.041568
10     -0.956056
11     -0.457244
12     -0.083135
13     -0.623515
14     -0.374109
15     -1.579570
16     -0.872920
17     -1.496435
18     -0.789785
19     -1.579570
20     -1.371732
21      0.166271
22      0.041568
23      0.581947
24      0.124703
25      0.290973
26      0.374109
27      0.498812
28      0.249406
29      0.290973
30      1.122326
31      0.997623
          ...   
4146    1.247029
4147    1.870544
4148   -2.203085
4149   -1.787409
4150   -1.621138
4151   -1.454867
4152   -0.956056
4153   -0.914488
4154   -0.872920
4155   -0.581947
4156   -0.581947
4157   -0.540379
4158    0.124703
4159    0.332541
4160    0.332541
4161   -1.330165
4162   -1.288597
4163   -1.288597
4164   -1.163894
4165   -0.581947
4166   -0.374109
4167   -0.249406
4168   -0.207838
4169    0.041568
4170    0.124703
4171    0.166

## 2. Preprocessing text (Optional)

One possible way to transform text documents into vectors of numeric attributes is to use the TF-IDF representation. We will experiment with this representation using the 20 Newsgroup data set. The data set contains postings on 20 different topics. The classification problem is to decide which of the topics a posting falls into. Here, we will only consider postings about medicine and space.

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['sci.med', 'sci.space']
raw_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print 'The index of each category is: {}'.format([(i,target) for i,target in enumerate(raw_data.target_names)])

Check out some of the postings, might find some funny ones!

In [None]:
idx = np.random.randint(0, len(raw_data.data))
print 'This is a {} email.\n'.format(raw_data.target_names[raw_data.target[idx]])
print 'There are {} emails.\n'.format(len(raw_data.data))
print(raw_data.data[idx])

Lets pick the first 10 postings from each category

In [None]:
idxs_med = np.flatnonzero(raw_data.target == 0)
idxs_space = np.flatnonzero(raw_data.target == 1)
idxs = np.concatenate([idxs_med[:10],idxs_space[:10]])
data = np.array(raw_data.data)
data = data[idxs]

<a href="http://www.nltk.org/">NLTK</a> is a toolkit for natural language processing. Take some time to install it and go through this <a href="http://www.slideshare.net/japerk/nltk-in-20-minutes">short tutorial/presentation</a>.

The downloaded package below is a tokenizer that divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [None]:
import nltk
import itertools
nltk.download('punkt')

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in data]
vocabulary_size = 1000
unknown_token = 'unknown'

In [None]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())

In [None]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
 
print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])

### Exercise 2.1

Code your own TF-IDF representation function and use it on this dataset. (Don't use code from libraries. Build your own function with Numpy/Pandas)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
df = pd.DataFrame(countvec.fit_transform(data).toarray(), columns=countvec.get_feature_names())

def tfidf(df):
    
    ##################
    #INSERT CODE HERE#
    ##################
    
    return rep

rep = tfidf(df)

# Check if your implementation is correct
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False, use_idf=True)
X_train = pd.DataFrame(vectorizer.fit_transform(data).toarray(), columns=countvec.get_feature_names())
answer=['No','Yes']
if rep is not None:
    print 'Is this implementation correct?\nAnswer: {}'.format(answer[1*np.all(X_train == rep)])