Sometimes your input data can be nested with more difficult structure than a simple table or a matrix.

In such cases it is sometime useful to shift mental orientation to analyze and extract information froms rows rather then non-defined columns.

In [1]:
from utils import css_from_file
css_from_file('style/style.css')

In [None]:
!pip install nltk

In [2]:
import json
import numpy as np
import pprint
from nltk import download, word_tokenize

download('punkt')

[nltk_data] Downloading package punkt to /home/pawel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
with open("data/companies/companies.json") as dataf:
    data = [json.loads(line) for line in dataf]

An example of deeply nested data with various data types:

Exercise:

1. Name variable types
2. What do you do with lists, geo location?
3. What do you do with counts?

In [4]:
pprint.pprint(data[7])

{'address': {'city': 'Seattle',
             'country': 'United States',
             'postalCode': '98134',
             'raw': '624 South Lander St\n'
                    'Suite 28\n'
                    'Seattle,\n'
                    'WA\n'
                    '98134\n'
                    'United States',
             'region': 'WA',
             'street': '624 South Lander St'},
 'description': 'At 36th avenue design|build we are committed to total client '
                'satisfaction. We believe that strong and lasting '
                'relationships built on integrity and trust, earned through '
                'the remodel, is as important as the renovation of your home. '
                'We consider every project an opportunity to participate with '
                'our clients in a unique and artful, design and construction '
                'process. To each of our clients, our commitment remains '
                'consistent: concise communication, integrity, and prid

Exercise
--------------

Write a pipeline to transform company records.
1. Select 3 types of features you want to transform (like descrpition, list of skills, technologies, address etc)
2. Create a pipeline in this format:
```python
make_union(
    make_pipeline(TechnologyFeatures(), DictVectorizer()),
    make_pipeline(AddressFeatures(), DictVectorizer()),
    make_pipeline(ExtractDescription(), CountVectorizer())
)
```
3. Classify industry (like in the previous exercise)

Exercise
===============

1. Write a transformation class called SparsityFilter that accepts a minimum frequency. Watch out for fit function - this class has some state that you must save

```
class SparsityFilter(BaseEstimator, TransformerMixin):
    def __init__(self, min_nnz=None):
        self.min_nnz = min_nnz

    def fit(self, X, y=None):
        ???
        return self

    def transform(self, X):
        return ???
```

In [None]:
# write sparsity class here

Double click to see the solution 

<div class="spoiler">

class SparsityFilter(BaseEstimator, TransformerMixin):
    def __init__(self, min_nnz=None):
        self.min_nnz = min_nnz

    def fit(self, X, y=None):
        self.sparsity = X.getnnz(0)
        return self

    def transform(self, X):
        return X[:, self.sparsity >= self.min_nnz]
</div>