# Text Processing

In this exercise we will learn how to perform text classification. For instance we will perform sentiment analisys on movie reviews. You have a folder *../data/text_polarity/* with two files one with positive and one negative with negative reviews.

First we will import some packages that we need for our task and here we go!

In [None]:
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.feature_extraction.text import TfidfVectorizer

### What's do we have to do first?
Get in contact with the data. How does it look like? Do we need to delete something? Take a look at the data, think about the polarity, do you agree with the sentiment?


In [None]:
pos_data = open("../data/text_polarity/positive_pl.txt", "r")

for line in pos_data:
    print(line)
    

### What's next?
Prepare the data. 

**How?** 

Could you see how many reviews do we have in each file? 
The idea is that you read those files merge them into one and create labels for them.



***Hint:***
If you open a file using readlines, it would return a list.


In [None]:
pos_data = open("../data/text_polarity/positive_pl.txt", "r").readlines()
neg_data = open("../data/text_polarity/negative_pl.txt", "r").readlines()

print(len(pos_data))
print(len(neg_data))

data = pos_data + neg_data

Now let's create the labels. You might remember that we already imported pandas and numpy. 

***Hint 1:***
Remember to use numerical values as labels, for example positive 0 and negative 1. Or viceversa.

**Hint 2:**
The numpy function full creates a vector with a desired shape and fullfils it with a desired value. Create your vectors and concatenate them into one. 

In [None]:
np.full?

In [None]:
pos_labels = np.full(len(pos_data), 1)
neg_labels = np.full(len(neg_data), 0)

labels = np.concatenate((pos_labels, neg_labels))

labels[0]
print("Labels", len(labels))
print("Data", len(data))

Now, let's split the data into train and test sets and shuffle, so that our model does not learn from data order. Scikit Learn has a function that performs those splits for you. 

You can check `train_test_split`

In [None]:
train_test_split?

In [None]:
# You will need to create store your data in X_train, y_train, X_test, y_test
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.25, random_state=1234, shuffle=True)
print(X_train[15])
print(y_train[15])


Normally models do not process text but a numerical version of it. Therefore we need to vectorize our data. There are several ways of doing it, you might remember that from our previous presentation. Here are a couple of them:

**Bag of words**: 
Let's imagine that we have four documents:

 1. I like to eat an apple every day
 2. I like to eat some fruits every day
 3. I do not like to eat fruits
 4. I like to cook with fruits every day
 
What about if we list all words from our documents, how would you do it?
You can check the string function `split` and the method `sorted`

In [None]:
str.split?

In [None]:
sorted?

In [None]:
documents = "I like to eat apples " \
            "I like to eat fruits " \
            "I do not like apples " \
            "I like to cook every day"

words = documents.split(" ")
word_list = list(set(words))
print(sorted(word_list))

Having the list, we can now write manually our bag of words for each document:

1. [1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1]
2. [1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1]
3. [1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0]
4. [1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1]

And tada!! This are the vectors that we would feed in our model.

**Tf-Idf**:
Another way of vectorizing the data could be calculating Tf-Idf for each document.

- Tf: Term frequency: $\frac{freq(word)}{\# words \in doc} $
- Idf: Inverse document frequency: $\log\frac{|D|}{\# d : word \in doc}$

**Examples:**

- Tf-Idf(I) = $\frac{1}{5}*\log(\frac{4}{4}) = 0$
- Tf-Idf(apples) = $\frac{1}{5}*\log(\frac{4}{2}) = 0.2*0.301 = 0.0602$


However, we do not want to calculate this for every word in our corpus. Therefore, in order to vectorize our data, we can use the `TfidfVectorizer` function from SkLearn and test our model on the preprocessed data. The resulting vector from this operation is the input to train and test our model.

In [None]:
vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=5)
trainMatrixTfidf = vectorizer.fit_transform(X_train)
trainInput = trainMatrixTfidf
clf = svm.LinearSVC()
clf.fit(trainInput, y_train)
testMatrixTfidf = vectorizer.transform(X_test)
test = clf.predict(testMatrixTfidf)

Our last step is to calculate accuracy, which is relevant in this case, since we already know that this corpus is balanced.

In [None]:
correct_answers = np.sum(np.equal(test, y_test))
accuracy = correct_answers / (len(test)*1.0) * 100

print(accuracy)

If you still have extra time try following the same steps this our tweets data. Data: *./data/twitter_data/testdata.csv*

**Note:** This data contain 3 classes, positive, negative and neutral, you can extract positive and negative tweets and work with them. 

Here is the data format:
* The polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* The id of the tweet (2087)
* The date of the tweet (Sat May 16 23:58:44 UTC 2009)
* The query (lyx). If there is no query, then this value is NO_QUERY.
* The user that tweeted (robotickilldozr)
* The text of the tweet (Lyx is cool)

Also make sure that you read correctly the files, those are csv files that might be easily read using the function _read_csv_ from pandas. Maybe you can compare both read-file-functions and note the advantages of both.

In [None]:
tweet_file = "../data/text_twitter/testdata.csv"
tweets_train = open(tweet_file, "r").readlines()
train_data = pd.read_csv(tweet_file, header=None, delimiter=",")

In [None]:
display(tweets_train)

Now it's time to extract only text and labels. You might also consider deleting punctuation and weird stuff that might misslead your model.

**Hint:** Check the _iterrows_ function in pandas to iterate over each row in the file.

In [None]:
labels = []
corpus = []
for index, row in train_data.iterrows():
    if row[0] == 0 or row[0] == 4:
        labels.append(row[0])
        cleanText = re.sub(r"[^A-Za-z0-9]", " ", row[5])
        cleanText = re.sub(r"\s{2,}", " ", cleanText)
        corpus.append(cleanText)


Now is your time to feed this data again into a model and check it's performance :)