# Create data

This notebook creates the data that is used in the examples

There is a data set that will process without problems in the examples and one that will have issues to see the difference. There are also some excel outputs for the scripts example.

The specific sections for creating tables are: 
+ [Conversions](#Conversions), converting column dtypes
+ [Altering](#Altering), changing the values in the DataFrame, adding new columns, dropping rows or columns etc
+ [Checks](#Checks), looking for outliers or rows that data does not follow the prescribed rules
+ [For summary tables](#For-summary-tables), there is one table here and it's for a summary output

## Setup
<ht>

Import and settings options

In [1]:
import sqlite3
import pickle
import datetime

import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 10)

## Create tables
<hr>

There are lots of different but small tables used in the examples

### Conversions
<hr>

In [3]:
df_convert = pd.DataFrame(
    [
        ('A', '1', '0.6', '2019-01-01'),
        ('B', '4', '5.2', '2019-02-05'),
        ('C', '1', '5.6', '2018-12-17'),
        ('D', '10', '15.9', '2019-07-18'),
        ('E', '-8', '4.7', '2018-03-09')
    ],
    columns=['object', 'int', 'float', 'date']
)

In [4]:
df_convert_issues = pd.DataFrame(
    [
        ('A', '1', '0.6', '2019-02-29'),
        ('B', '4.5', 'A', '2019-22-05'),
        ('C', '1', '5.6', '2018-12-17'),
        ('D', 'b', '15.9', '2019-09-31'),
        (5, '-8', '4.7', '2018-03-09')
    ],
    columns=['object', 'int', 'float', 'date']
)

### Altering
<hr>

In [5]:
df_alterations = pd.DataFrame(
    [
        ('A', 2, 'key_1'),
        ('B', 199, 'key_2'),
        ('C', -1, 'key_1'),
        ('D', 20, 'key_3'),
        ('E', 6, 'key_2')
    ],
    columns=['to_map', 'add_1', 'merge_key']
)

In [6]:
df_alterations_issues = pd.DataFrame(
    [
        ('A', 2, 'key_1'),
        ('B', 199, 2),
        ('C', -1, 'key_1'),
        (['D'], 'a', 'key_3'),
        ('E', 6, 'key_2')
    ],
    columns=['to_map', 'add_1', 'merge_key']
)

### Checks
<hr>

In [7]:
df_checks = pd.DataFrame(
    [
        (3, 'A', 'a'),
        (10, 'A', 'z'),
        (9, 'B', 'b'),
        (4, 'D', 'd'),
        (7, 'C', 'c')
    ],
    columns=['number', 'category_1', 'category_2']
)

In [8]:
df_checks_issues = pd.DataFrame(
    [
        (1, 'Z', 'y'),
        (10, 'A', 'a'),
        (9, 'Y', 'b'),
        (4, 'B', 'b'),
        (-1, 'C', 'c')
    ],
    columns=['number', 'category_1', 'category_2']
)

### For summary tables
<hr>

In [9]:
df_summary = pd.DataFrame(
    [
        ('b', 'c', 1, 6),
        ('d', 'b', 1, 9),
        ('c', 'b', 1, 0),
        ('d', 'd', 1, 9),
        ('c', 'b', 1, 1),
        ('a', 'd', 1, 3),
        ('c', 'c', 1, 0),
        ('c', 'd', 1, 0),
        ('c', 'c', 1, 0),
        ('a', 'e', 1, 4),
        ('b', 'e', 1, 7),
        ('a', 'd', 1, 4),
        ('b', 'e', 1, 6),
        ('b', 'c', 1, 8),
        ('b', 'c', 1, 7),
        ('d', 'e', 1, 9),
        ('a', 'b', 1, 5),
        ('a', 'd', 1, 5),
        ('a', 'b', 1, 4),
        ('d', 'b', 1, 10),
        ('b', 'c', 1, 6),
        ('b', 'e', 1, 7),
        ('a', 'e', 1, 4),
        ('a', 'c', 1, 3),
        ('c', 'c', 1, 0),
        ('c', 'd', 1, 2),
        ('a', 'b', 1, 3),
        ('a', 'e', 1, 5),
        ('a', 'c', 1, 3),
        ('a', 'e', 1, 4),
        ('b', 'd', 1, 6),
        ('c', 'e', 1, 1),
        ('b', 'e', 1, 7),
        ('c', 'c', 1, 0),
        ('a', 'c', 1, 5),
        ('c', 'b', 1, 0),
        ('d', 'b', 1, 8),
        ('d', 'e', 1, 10),
        ('d', 'c', 1, 8),
        ('a', 'd', 1, 3),
        ('d', 'e', 1, 10),
        ('d', 'c', 1, 8),
        ('d', 'e', 1, 10),
        ('a', 'c', 1, 4),
        ('d', 'b', 1, 8),
        ('d', 'b', 1, 10),
        ('d', 'e', 1, 10),
        ('a', 'c', 1, 5),
        ('a', 'd', 1, 5),
        ('d', 'c', 1, 10)
    ],
    columns=['str', 'str_2', 'count', 'int_max']
)

### For scripts
<hr>

In [10]:
df_data = pd.DataFrame(
    [
        (1, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2018, 7, 7, 0, 0), 
         'A string this is', 51.5074, 0.1278),
        (1, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2018, 4, 9, 0, 0), 
         'Test', 51.5084, 0.1268),
        (1, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2018, 1, 10, 0, 0), 
         'testing', 51.5094, 0.1258),
        (3, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2017, 10, 13, 0, 0),
         'test test test', 51.5104, 0.1248),
        (4, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2017, 7, 16, 0, 0),
         np.nan, 51.5114, 0.1238),
        (5, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2017, 4, 18, 0, 0), 
         np.nan, 51.5124, 0.1228),
        (6, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2017, 1, 19, 0, 0),
         'Blah', 51.5134, 0.1218),
        (7, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2016, 10, 22, 0, 0),
         'Dah', 51.5144, 0.1208),
        (1234, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2016, 7, 25, 0, 0), 
         'Doh', 51.5154, 0.1198),
        (3, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2016, 4, 27, 0, 0),
         'Boh', 51.5164, 0.1188),
        (2341243, datetime.datetime(2019, 1, 1, 0, 0), datetime.datetime(2016, 1, 29, 0, 0),
         'Pho', 51.5174, 0.1178)
    ],
    columns=['Number', 'A date', 'Another date£', '   StringStringString   ', 'lat', 'lng']
)

In [11]:
df_headers_1 = pd.DataFrame(
    [
        ('Header', 'Number', 'A date', 'Another date£', '   StringStringString   ', 'lat', 'lng'), 
        ('New name', 'a_number', 'date_1', 'date_2', 'string', 'lat', 'lng'),
        ('Remove', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan),
        ('Notes', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan)
    ]
)

In [12]:
df_ideal_headers = pd.DataFrame(
    [
        ('a_number', 'date_1', 'date_2', 'string', 'testing', 'a', 'b', 'lat', 'lng')
    ]
)

## Write out data
<hr>

In [13]:
df_convert.to_csv('data/df_convert.tsv', sep='\t', index=False)
df_convert_issues.to_csv('data/df_convert_issues.tsv', sep='\t', index=False)

df_alterations.to_csv('data/df_alterations.tsv', sep='\t', index=False)
df_alterations_issues.to_csv('data/df_alterations_issues.tsv', sep='\t', index=False)

pickle.dump(df_checks, open('data/df_checks.pkl', 'wb'))
pickle.dump(df_checks_issues, open('data/df_checks_issues.pkl', 'wb'))

pickle.dump(df_summary, open('data/df_summary.pkl', 'wb'))

df_data.to_excel('data/A.xlsx', index=False)
xl_writer = pd.ExcelWriter('data/headers.xlsx')
df_headers_1.to_excel(xl_writer, index=False, sheet_name='A 1', header=None)
df_ideal_headers.to_excel(xl_writer, index=False, sheet_name='IdealHeaders', header=None)
xl_writer.save()
xl_writer.close()

---

**GigiSR**