# Pre-process all the data

Running the code cell below will pre-process all the data and save it to file. You're encouraged to lok at the code for `preprocess_and_save_data` in the `helpers.py` file to see what it's doing in detail, but you do not need to change this code.

In [1]:
# Corpus reader:
import numpy as np
import random

import os
root = './Confs_newline/Conf2/'
from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader = CategorizedPlaintextCorpusReader(root, r'.*\.txt', cat_pattern=r'(\w+)/*', encoding='utf-8')

In [2]:
print(reader.categories())
print(reader.fileids())

['kiz', 'kork', 'mutlu', 'notr', 'uzul']
['kiz.txt', 'kork.txt', 'mutlu.txt', 'notr.txt', 'uzul.txt']


In [3]:
### First, tokenize Punctuation: 
# create a token dictionary:
punc_dict= {'.':'||PERIOD||', ',': '||COMMA||', '"': '||QUOTATION_MARK||', ';': '||SEMICOLON||',
                '!': '||EXCLAMATION_MARK||', '?': '||QUESTION_MARK||', '(': '||LEFT_PAREN||',
                ')': '||RIGHT_PAREN||', '?': '||QUESTION_MARK||', 
                '\n': '||NEW_LINE||', '-': '||DASH||'}

In [4]:
def sent_tokenize_whole_tweets(text): # raw text --> whole tweets file content
    for key, token in punc_dict.items():
        text = text.replace(key, ' {} '.format(token))

    sentences= []
    for line in text.split('||NEW_LINE||'):
        line= line.strip()
        sentences.append(line)
    return sentences

In [5]:
all_text=[]
labels= []

In [6]:
for label,file_name in zip(reader.categories(), reader.fileids()):
    sentences= sent_tokenize_whole_tweets(reader.raw(file_name)) # --> this should return a list of contents
    labels.extend([label for i in sentences])
    all_text.extend([i.lower() for i in sentences])
print(len(labels))
print(len(all_text))
# Now, we have all tweets in all_text list!

3317
3317


## Transforming Text into Numbers

In [7]:
word_counts={}
for i in range(len(all_text)):
    for word in all_text[i].split(" "):
        word_counts[word] = word_counts.get(word,0) +1

vocab = set(word_counts.keys())
vocab_size = len(vocab)
print("Number of unique words: {} ".format(vocab_size))

sorted_word_counts= sorted(word_counts, key= word_counts.get, reverse= True)

int_to_vocab= {ii: word for ii,word in enumerate(sorted_word_counts)}
vocab_to_int= {word: ii for ii, word in int_to_vocab.items()}


Number of unique words: 3704 


In [8]:
all_text[2]

'hayır çok sev dizi yani aşırı kız'

In [9]:
int_to_vocab[0]

'çok'

In [10]:
word_counts['çok']

696

# Reducing Noise in the Input Data

## Subsampling

Words that show up often such as "the", "of", and "for" don't provide much context to the nearby words. If we discard some of them, we can remove some of the noise from our data and in return get faster training and better representations. This process is called subsampling by Mikolov. For each word $w_i$ in the training set, we'll discard it with probability given by 

$$ P(w_i) = 1 - \sqrt{\frac{t}{f(w_i)}} $$

where $t$ is a threshold parameter and $f(w_i)$ is the frequency of word $w_i$ in the total dataset.

$$ P(0) = 1 - \sqrt{\frac{1*10^{-5}}{1*10^6/16*10^6}} = 0.98735 $$

I'm going to leave this up to you as an exercise. Check out my solution to see how I did it.

> **Exercise:** Implement subsampling for the words in `int_words`. That is, go through `int_words` and discard each word given the probablility $P(w_i)$ shown above. Note that $P(w_i)$ is the probability that a word is discarded. Assign the subsampled data to `train_words`.

In [11]:
vocab_to_int['her']

47

In [12]:
int_words = [vocab_to_int[word] for word in vocab]
print(int_words[:5])

[6, 1690, 1785, 2597, 2884]


In [13]:
threshold = 1e-3

word_counts_intwords = {vocab_to_int[word]:count for word,count in word_counts.items()}
total_count = vocab_size
freqs = {word: count/total_count for word, count in word_counts_intwords.items()}
p_drop = {word: 1 - np.sqrt(threshold/freqs[word]) for word in word_counts_intwords}
# discard some frequent words, according to the subsampling equation
# create a new list of words for training

train_words = [word for word in word_counts_intwords if random.random() < (1 - p_drop[word])]
print(train_words[:30])

[2020, 1178, 1456, 150, 20, 17, 768, 263, 398, 291, 1179, 689, 2021, 230, 282, 106, 1180, 1457, 636, 166, 1181, 2022, 116, 578, 2023, 2024, 1006, 340, 277, 1458]


In [14]:
len(train_words)

3260

In [15]:
vocab_size


3704

In [16]:
# while training the network, do not feed the words not in this list!
# clean the tweets from those words:

In [17]:
all_text[0]

'çok kız ||exclamation_mark||  ne kadar ayıp şey'

In [18]:
for i in range(len(all_text)):
    line_split= all_text[i].split(" ")
    line_split_reduced= [word for word in line_split if vocab_to_int[word] in train_words]
    all_text[i]= ' '.join(line_split_reduced)

In [19]:
# Now, delete empty tweets:

In [20]:
len(all_text)

3317

In [27]:
all_text_modified= [line for line in all_text if len(line)>0]
labels_modified= [labels[i] for i in range(len(all_text)) if len(all_text[i])>0]
print(len(all_text_modified))
print(len(labels_modified))


3144
3144


# Encoding the words
The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

    Exercise: Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers start at 1, not 0. Also, convert the reviews to integers and store the reviews in a new list called reviews_ints.


In [25]:
word_counts={}
for i in range(len(all_text_modified)):
    for word in all_text_modified[i].split(" "):
        word_counts[word] = word_counts.get(word,0) +1

vocab = set(word_counts.keys())
vocab_size = len(vocab)
print("Number of unique words: {} ".format(vocab_size))

sorted_word_counts= sorted(word_counts, key= word_counts.get, reverse= True)

int_to_vocab= {ii: word for ii,word in enumerate(sorted_word_counts, 1)} #start from 1.
vocab_to_int= {word: ii for ii, word in int_to_vocab.items()}


Number of unique words: 3260 


In [29]:
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
tweets_ints = []
for tweet in all_text_modified:
    tweets_ints.append([vocab_to_int[word] for word in tweet.split()])

In [35]:
tweets_ints[:5]

[[1577],
 [735, 1013, 25, 2, 1, 25, 365],
 [68],
 [133, 82, 736, 307, 1578, 54],
 [77, 15]]

Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

    Exercise: Convert labels from positive and negative to 1 and 0, respectively, and place those in a new list, encoded_labels.

# 1=positive, 0=negative label conversion



In [36]:
labels_split = labels.split('\n')

encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

AttributeError: 'list' object has no attribute 'split'

Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some seq_length, we'll pad with 0s. For reviews longer than seq_length, we can truncate them to the first seq_length words. A good seq_length, in this case, is 200.

    Exercise: Define a function that returns an array features that contains the padded data, of a standard size, that we'll pass to the network.

        The data should come from review_ints, since we want to feed integers to the network.
        Each row should be seq_length elements long.
        For reviews shorter than seq_length words, left pad with 0s. That is, if the review is ['best', 'movie', 'ever'], [117, 18, 128] as integers, the row will look like [0, 0, 0, ..., 0, 117, 18, 128].
        For reviews longer than seq_length, use only the first seq_length words as the feature vector.

As a small example, if the seq_length=10 and an input review is:

[117, 18, 128]

The resultant, padded sequence should be:

[0, 0, 0, 0, 0, 0, 0, 117, 18, 128]

Your final features array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified seq_length.

This isn't trivial and there are a bunch of ways to do this. But, if you're going to be building your own deep learning networks, you're going to have to get used to preparing your data.