Topics
- Feature extraction and engineering: transformation of raw data into features suitable for modeling;
- feature transformation: transformation of data to improve the accuracy of the algorithm;
- feature selection: removing unnecessary features.

In [1]:
# preload dataset automatically, if not already in place.
import os
from pathlib import Path
import numpy as np
import pandas as pd

def download_file_from_gdrive(file_url, filename, out_path: Path, overwrite=False):
    """
    Downloads a file from GDrive given an URL
    :param file_url: a string formated as https://drive.google.com/uc?id=<file_id>
    :param: the desired file name
    :param: the desired folder where the file will be downloaded to
    :param overwrite: whether to overwrite the file if it already exists
    """
    file_exists = os.path.exists(f'{out_path}/{filename}')

    if (file_exists and overwrite) or (not file_exists):
        os.system(f'gdown {file_url} -O {out_path}/{filename}')

In [11]:
# preload dataset automatically, if not already in place.
import os

import requests

url = "https://drive.google.com/uc?export=download&id=1_lqydkMrmyNAgG4vU4wVmp6-j7tV0XI8"
file_name = "/Users/aadrijupadya/Downloads/renthop_train.json"


def load_renthop_dataset(url, target, overwrite=False):
    # check if exists already
    if os.path.isfile(target) and not overwrite:
        print("Dataset is already in place")
        return

    print("Will download the dataset from", url)

    response = requests.get(url)
    open(target, "wb").write(response.content)


load_renthop_dataset(url, file_name)

Dataset is already in place


In [9]:
import numpy as np
import pandas as pd

df = pd.read_json(file_name, compression="gzip")

BadGzipFile: Not a gzipped file (b'{"')

In [12]:
#dataset aims to predict the popularity of a new rental listing, will use log loss metric

In [13]:
#feature extraction converts csv data into numpy arrays

For text data, we first tokenize it (separate it according to lingo, etc.) and then normalize it using stemming and lemmatization

Easiest method: Bag of Words, we create a vector with the length of the vocabulary, compute the number of occurrences of each word in the text, and place that number of occurrences in the appropriate position in the vector. 

In [14]:
texts = ["i have a cat", "you have a dog", "you and i have a cat and a dog"]

vocabulary = list(
    enumerate(set([word for sentence in texts for word in sentence.split()]))
)
print("Vocabulary:", vocabulary)


def vectorize(text):
    vector = np.zeros(len(vocabulary))
    for i, word in vocabulary:
        num = 0
        for w in text:
            if w == word:
                num += 1
        if num:
            vector[i] = num
    return vector


print("Vectors:")
for sentence in texts:
    print(vectorize(sentence.split()))

Vocabulary: [(0, 'and'), (1, 'you'), (2, 'a'), (3, 'dog'), (4, 'i'), (5, 'cat'), (6, 'have')]
Vectors:
[0. 0. 1. 0. 1. 1. 1.]
[0. 1. 1. 1. 0. 0. 1.]
[2. 1. 2. 1. 1. 1. 1.]


This is an extremely naive implementation. In practice, you need to consider stop words, the maximum length of the vocabulary, more efficient data structures (usually text data is converted to a sparse vector), etc.

When using algorithms like Bag of Words, we lose the order of the words in the text, which means that the texts “i have no cows” and “no, i have cows” will appear identical after vectorization when, in fact, they have the opposite meaning. To avoid this problem, we can revisit our tokenization step and use N-grams (the sequence of N consecutive tokens) instead.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(1, 1))
vect.fit_transform(["no i have cows", "i have no cows"]).toarray()

array([[1, 1, 1],
       [1, 1, 1]])

In [16]:
vect.vocabulary_

{'no': 2, 'have': 1, 'cows': 0}

In [17]:
vect = CountVectorizer(ngram_range=(1, 2))
vect.fit_transform(["no i have cows", "i have no cows"]).toarray()

array([[1, 1, 1, 0, 1, 0, 1],
       [1, 1, 0, 1, 1, 1, 0]])

In [18]:
vect.vocabulary_

{'no': 4,
 'have': 1,
 'cows': 0,
 'no have': 6,
 'have cows': 2,
 'have no': 3,
 'no cows': 5}

Vectorization is able to now separate these two different sentences

In [20]:
from scipy.spatial.distance import euclidean
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(3, 3), analyzer="char_wb")

n1, n2, n3, n4 = vect.fit_transform(
    ["smith", "petersen", "petrov", "smith"]
).toarray()

euclidean(n1, n4), euclidean(n2, n3), euclidean(n3, n4)

(0.0, 3.1622776601683795, 3.3166247903554)

The code above analyzes differences in similarity through NLP techniques (char_wb in this case), CountVectorizer. Convert a collection of text documents to a matrix of token counts. This implementation produces a sparse representation of the counts using scipy.


TF-IDF (term frequency - inverse document frequency) is a numerical statistic that intends to reflect how important a certain word is to a document by analyzing words that appear more frequently

Word2Vec is another NLP vectorization technique that quantifies words as vectors in high-dimensional space and can compare semantic similarity (king - man + woman = queen)

### Images

For images, convolutional neural networks are the most common algorithm used

In [None]:
import keras

In [1]:
print('S')

S
