In [None]:
#
# In order to run this notebook, you first have to install Tally. To install tally you need a token that gives you access.
#
from google.colab import files
import json
import io
import os
# Check if the file 'tally_keys.json' exists
if not os.path.exists('tally_keys.json'):
  uploaded = files.upload()
  # Assuming only one file is uploaded, get its filename and content
  filename = list(uploaded.keys())[0]
  file_content = uploaded[filename]
  # Load JSON directly from the uploaded content
  keys = json.loads(file_content.decode('utf-8'))
else:
  # If the file already exists, just load its content
  with open('tally_keys.json', 'r') as f:
      keys = json.load(f)

try:
  # Try to import the package
  import example_package
except ImportError:
  # If the import fails, the package is not installed. Install it.
  !pip install git+https://{keys['tally_api']}@github.com/datasmoothie/tally-core.git@master

In [1]:
import tally_core as tc
import pandas as pd
import json

dataset = tc.DataSet("Sports stores")
meta = json.load(open('./data/Example Data (A).json'))
data = pd.read_parquet('./data/Example Data (A).parquet')
dataset.from_components(meta_dict=meta, data_df=data)

# 3. Clean and recode data

## Filter
When we want to remove certain rows from the dataset, given certain conditions, we use <a href="API/DataSet.html#tally_core.DataSet.filter">`DataSet.filter`</a>. 

In [2]:
urban_and_suburban = dataset.filter(
  alias='urban and suburban', 
  condition={'locality':[2,3]}
)

Filter can use complex, nested logic to combine AND and OR arguments along with other logical operators. Refer to the chapter about [Tally's logic operators](tally_logic) to see more on how filters are constructed.

For a sanity check, we can examine the crosstab of the variable we used to create the filter.

In [3]:
dataset.crosstab('locality')
urban_and_suburban.crosstab('locality')

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
locality. How would you describe the areas in which you live?,Base,8078.0
locality. How would you describe the areas in which you live?,CBD (central business district),3106.0
locality. How would you describe the areas in which you live?,Urban,2245.0
locality. How would you describe the areas in which you live?,Suburban,1180.0
locality. How would you describe the areas in which you live?,Rural,718.0
locality. How would you describe the areas in which you live?,Remote,829.0


Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
locality. How would you describe the areas in which you live?,Base,3425.0
locality. How would you describe the areas in which you live?,CBD (central business district),0.0
locality. How would you describe the areas in which you live?,Urban,2245.0
locality. How would you describe the areas in which you live?,Suburban,1180.0
locality. How would you describe the areas in which you live?,Rural,0.0
locality. How would you describe the areas in which you live?,Remote,0.0


:::{warning}
By default, the `filter` method returns a new dataset with the result of the filter. For large datasets, this can consume lots of memory and can sometimes make the difference of your data processing script running smoothly and running out of memory. Setting the parameter `inplace` to `True` will instead modify the dataset currently in memory. The `filter` method should not be overused as filters can be passed into those functions that need to operate on a subset of the data.
:::





## Recode
Recoding can help with tasks such as cleaning, modifying or correcting data. 

For example, if we want to recode `locality` so that people who live in Central Business Districs are counted as living in Urban areas, we construct our recode logic so that the logical expression `{'locality':[1]}` gets the code 2.

In [4]:
recode_logic = {
  2: {'locality': 1}
}
dataset.recode('locality', recode_logic)

Then we use `crosstab` to check our results.

In [5]:
dataset.crosstab('locality')

Unnamed: 0_level_0,Question,Total
Unnamed: 0_level_1,Values,Total
Question,Values,Unnamed: 2_level_2
locality. How would you describe the areas in which you live?,Base,8078.0
locality. How would you describe the areas in which you live?,CBD (central business district),0.0
locality. How would you describe the areas in which you live?,Urban,5351.0
locality. How would you describe the areas in which you live?,Suburban,1180.0
locality. How would you describe the areas in which you live?,Rural,718.0
locality. How would you describe the areas in which you live?,Remote,829.0
