# Synthesizing Realistic Data
*Useful for testing*

In the process of writing about a pandas function, I realized I needed a realistic dataset to effectively demonstrate its use.

Here's how you can use [Faker](https://faker.readthedocs.io/en/master/index.html) to do this

Start with installing Faker with pip:

```bash
pip install faker
```

### Using Faker

In [1]:
from faker import Faker
fake = Faker()

print(f"Hello, my name is {fake.first_name()} {fake.last_name()}.\n"
      f"I'm a {fake.job()} at {fake.company()}.")

Hello, my name is Darlene Parker.
I'm a Phytotherapist at Martin-Austin.


### Extending Faker

The full list of "providers" is available in the [docs](https://faker.readthedocs.io/en/master/providers.html).

I've found that wrapping including Faker in a class to mimic a particular data model efficient and easy to implement

In [2]:
class FakeShopper:

    """
    Convenience class

    Notes
    -----

    We define a set, `USERNAMES`
    A username should be unique, so we track those generated thus far

    For user_ids, let's assume they are auto-incremented.
    """

    FAKER = Faker()
    USERNAMES = set([])
    LAST_ID = 1

    def __init__(self, active_begin='-30d', active_end='now'):
        self.active_begin = active_begin
        self.active_end = active_end
        self.username = self.get_username()
        self.user_id = self.get_user_id()

    @classmethod
    def _make_unique(cls, func, unique_set):
        unique_set_attr = getattr(cls, unique_set)
        while True:
            result = func(cls.FAKER)
            if result not in unique_set_attr:
                unique_set_attr.add(result)
                return result

    def get_username(self):

        def make_username(f):
            return f.email().split("@")[0]

        return self._make_unique(make_username, "USERNAMES")

    @classmethod
    def get_user_id(cls):

        value = cls.LAST_ID
        cls.LAST_ID += 1
        return value

    def timestamp(self):
        return self.FAKER.date_time_between(self.active_begin,
                                            self.active_end)

    def product_id(self):
        return int(self.FAKER.ean())

    def product_action(self):
        return self.FAKER.random_element(['view', 'add_to_cart', 'save', 'share', 'purchase'])

    def activity(self):

        # Simulate activity on an e-commerce site
        return {
            "id": self.user_id,
            "username":self.username,
            "timestamp": self.timestamp(),
            "product_id": self.product_id(),
            "action": self.product_action()
            }

### Making a dataset

Now we can create data!

In [3]:
# Create 100 unique shoppers

shoppers = [FakeShopper() for _ in range(100)]

In [4]:
# We'd like some variation in the level of activity generated by each shopper.
# We will use an exponential distribution to simulate the number of interactions

import numpy as np

n_activities = np.random.exponential(100, 100).astype(int) + 1  # Add 1 so a shopper has at least 1 activity


In [5]:
import pandas as pd

def generate_data(shoppers, acivities):
    for shopper, n_actions in zip(shoppers, acivities):
        for _ in range(n_actions):
            yield shopper.activity()

df = pd.DataFrame(generate_data(shoppers, n_activities))

# Order by timestamp
df = df.sort_values('timestamp').reset_index(drop=True)

df.head(20)

Unnamed: 0,id,username,timestamp,product_id,action
0,51,smithselena,2020-06-12 00:15:15,8167065674545,view
1,70,newtonmichael,2020-06-12 00:15:45,8209493702992,purchase
2,4,debra96,2020-06-12 00:17:48,905645568581,add_to_cart
3,97,butlerlisa,2020-06-12 00:18:32,7494221770225,purchase
4,60,dstephens,2020-06-12 00:21:38,1022934969283,save
5,95,tlowe,2020-06-12 00:23:42,6034683060093,purchase
6,42,brookschad,2020-06-12 00:24:35,1461324381166,share
7,48,hadams,2020-06-12 00:29:27,892806748053,add_to_cart
8,97,butlerlisa,2020-06-12 00:32:52,5201939637205,share
9,55,grantpeggy,2020-06-12 00:34:08,9064054062729,view


### Finally

We've seen how we can synthesize datasets and embed them in a pandas DataFrame.

This idea can also be extended to <keyword>SQLAlchemy</keyword> Models using introspection

```python
class FakeModel:
    """
    Base Class for Generating Fake Data from a Flask-SQLAlchemy Model.
    """
    FAKER = Faker()

    def __init__(self, model):
        self.model = model
        self._columns = dict(db.inspect(model).columns.items())

    def map_type(self, k: str, v: Column):
        """
        Use introspection to return a callable that returns an appropriate value
        """
        if v.default:
            return lambda: v.default.arg  # Return the default value

        # Below we have to equate `Boolean` to native `bool`, `String` to native `str`, etc
        sqltype = v.type
        clstype = type(sqltype)

        if clstype == Boolean:   # Simple choice
            return lambda: choice([True, False])

        elif clstype == DateTime:  # Customize date range as needed
            return lambda: self.faker.date_time_between(start_date='-3d')


        # Strings and Text will require greater care to return realistic values
        # Here we look at the column name to guess what to return

        elif clstype == String or clstype == Text:
            max_len = sqltype.length or 512  # MySQL
            if all([s in k for s in ['user', 'name']]):  # User name
                return lambda: self.faker.user_name()[:max_len]
            elif 'email' in k:  # Email address
                return lambda: self.faker.ascii_email()[:max_len]
            elif k == 'full_name':
                return lambda: self.faker.name()[:max_len]
            else:
                return lambda: "_".join(self.faker.words())
        else:
            return lambda: None

    @property
    def columns(self) -> Dict[str, Column]:
        return self._columns

    def generate(self):
        data = {}
        for k, v in self.columns.items():
            if v.primary_key:
                continue
            data[k] = self.map_type(k, v)()
        return data
 ```
