# Introduction to Coding for AI

## 5. Data processing

So far we have been working with very small data snippets, but of course, datasets are normally stored in files or databases. In our challenge, we’ll focus on tabular data, which is a very common and flexible format that allows you to process text and numerical values. Think of the data you normally find in spreadsheets.

Let’s take a look at some of the common data formats for handling tabular data in real-world situations: TXT, CSV, and JSON.

### 5.1. Directory tree

All the data in your computer has a directory tree structure. For example, you can see the content of our challenge directory in your file explorer:

<img src="../data/content/directory_tree.png" width="90%"/>

And then think of the directory tree representing these files as the following one:

```
Introduction to Coding for AI/
│
├─ data/
│  │
│  ├─ content/
│  │  ├─ image_1.png
│  │  ├─ image_2.png
│  │  ├─ image_3.png
│  │  ├─ ...
│  │
│  ├─ datasets/
│     ├─ dataset_1.csv
│     ├─ dataset_2.json
│     ├─ dataset_3.xlxs
│     ├─ ...
│   
├─ notebooks/
   ├─ notebook_1.ipynb
   ├─ notebook_2.ipynb
   ├─ notebook_3.ipynb
   ├─ ...
```

The first step to load data from your computer is to tell Python where to search for it. Here we will import a module called `glob` from the standard library also called `glob`. *glob* reads all the files inside the **folder** (also called **directory**) that you specify and returns a list with their paths.

In [1]:
from glob import glob

all_file_paths_here = glob("./*")
python_file_paths_here = glob("./*.py")
all_file_paths_above = glob("../*")

print(f"all_file_paths_here:\n{all_file_paths_here}\n")
print(f"python_file_paths_here:\n{python_file_paths_here}\n")
print(f"all_file_paths_above:\n{all_file_paths_above}\n")

all_file_paths_here:
['./3. Data structures and handling errors.ipynb', './library.py', './5. Data processing.ipynb', './__pycache__', './2. Flow and Functions.ipynb', './4. Custom classes and modules.ipynb', './1. Getting Started.ipynb', './6. Training and evaluating models.ipynb']

python_file_paths_here:
['./library.py']

all_file_paths_above:
['../dist', '../README.md', '../local', '../data', '../notebooks']



The string we passed to `glob` starts with one dot and a forward slash (`./`). The dot (`.`) means **here**, the current directory where this Jupyter Notebook is located in your directory tree, and the forward slash (`/`) indicates that there is a directory there. Similarly, you use two dots (`../`) to tell Python that it should go one directory level **above**, and from there start following the rest of the path. You can repeat these two dots as many times as you want to go up in your directory tree, so `../../../` would go up three levels for example.

Next we see a star (`*`) and a star followed by the file extension of python files (`.py`). The star alone (`*`) means **everything**, so it tells `glob` to add all files and folders to the list that it will return. However, when the star is written next to other characters, its meaning changes. So, (`*.py`) indicates to `glob` to add all files ending with `.py` to the list that it will return.

To recap, to open files located in the same directory as the notebook, use one dot only, so `./file.text` searches for `file.text` in the current directory, and `./folder/file.text` searches for `file.text` in a folder called `folder` located in the same place as the jupyter notebook.

#### Exercise:
1. Copy the code of the cell above into the cell below.
2. Add a command that gets the list of all jupyter notebook files (`.ipynb`) in the current directory.
3. Store this list in a variable with a suitable name and print them.
4. Run the cell to show the results.

### 5.2. Text Data

In the following sections, we'll see how to **read** data from files, **transform** it to be ready for analysis, and **write** it back to your hard drive. Let's begin with an example of text data.

#### 5.2.1. Read

In the file that contains the notebooks, there is also a directory with datasets. We’ll start with a dataset called `spam.txt` that contains a collection of SMS messages. This dataset is a simplified version of the original SMS Spam Collection Dataset found in [Kaggle](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset).
In the dataset, each line contains the SMS text as well as the category (class) spam or ham. In case you don't know it, ham means that the SMS is ok and not spam.
Emails already have a spam filter, but wouldn't it be great if phone companies could do the same? That's what we'll try to help them with.
Let's start by reading the data, below is the method to read text files line-by-line, check it out:

In [2]:
with open("../data/datasets/spam.txt", "r") as file_handle:
    first_line = file_handle.readline()
    print(first_line)  # String formatted by print()
    print(repr(first_line))  # Raw, printable representation of the string

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat.... The class of this SMS is: ham

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat.... The class of this SMS is: ham\n'


Let’s unpack this code through the steps’ description below:

**with ... as ...**

The statement `with <object> as <handle>:` means that the code that you place after `with` has to return an object, and that you will assign this object to the variable indicated in `<handle>`. In our case, the built-in function `open()` returns an object associated with the file you are opening. When Python opens a file, it must close it once it has finished working with it, otherwise, the file could become corrupted. One way of doing this is by calling the method `<object.close()>`, but to simplify our code, when we use the `with ... :` method it creates a code block in the next line after the colon. Once your program finishes working inside this code block and goes out again, the `with` automatically closes the file for you. 

**open( )**

The statement `open("<path/to/filename.extension>", "mode")` opens a file with the **mode** that you indicate. The modes that we are interested in are `r`, `w` and `a`.
  - `r`: Only reads the file. This is a safe option to avoid messing up  the data.
  - `w`: Creates a file to write. Be careful, if the file exists already it will delete its content first. Better avoid this, unless you are sure of it!
  - `a`: Opens a file to write, but instead of overwritting its content, everything you write is appended at the end of the file.

**.readline( )**

Next we see that `first_line = file_handle.readline()` reads one line from the `file_handle` and stores it in the variable `print_line`. In the example, we only execute `.readline()` one time, so we only read the first line in the file. What if you want to read multiple lines? In that case, you can execute it inside a `for` loop (we’ll see the syntax in the following example). There is also a method called `.readlines()`, the plural, that reads all the lines in the file, but it’s not advised when you work with very large files! It’s good for you to know this method exists, but to stay away from trouble, let’s only use the first method and always read one line at a time. slow and steady wins the race 😉

**repr( )**

Even if the `print()` function doesn't show it, there is a "\n" at the end of each line in texts.
In the example, we use the function `repr()` to show it.
We'll explain a bit what is `repr()` now, but the main point for you to remember, is that at the end of each line of text in a file, there is a *hidden* character `\n`.

Notice how the first print command (`print(first_line)`) formats the string in `print_line`, so instead of showing `\n` at the end of the line, it **adds** a new line. Conversely, the second print command (`print(repr(first_line))`) uses `repr()` to get the *printable representation* of the string. This means that instead of interpreting the escape characters inside strings, `print()` will show all charachters inside the string, or it *raw* content.

#### 5.2.2. Transform

Now let's start doing some processing of the text in this file. As an exercise, let's create a list with the SMS texts and a list with the corresponding *classes* (*ham* or *spam*). **Attention:** in the context of *Python programming language*, a *class* is a code structure. In the context of *Machine Learning*, a *class* is the category of something. In the latter case, the SMS text may belong to the class *ham* or to the class *spam*.

**String parsing**

The first step is to know how to process, or **parse**, each line of text. In the previous cell we saw the first line in the file, so assuming all lines have the same structure, we can create a program that:

1. Splits the SMS text from the SMS class using the string `". The class of this SMS is: "` as a separator.
2. The SMS text is ready, so now we just need to remove the `\n` trailing the class name `ham`.
3. Then store the SMS in one list and its class in another.

In [3]:
# Important:
# Declare the lists before you use them,
# otherwise you'll get an error:
sms_texts = []
sms_classes = []

with open("../data/datasets/spam.txt", "r") as file_handle:
    
    counter = 0
    for line in file_handle:
        
        sms_text, sms_class = line.split(". The class of this SMS is: ")
        
        sms_class = sms_class[:-1]  # Remove "\n"
        
        sms_texts.append(sms_text)
        sms_classes.append(sms_class)
        
        counter += 1
        
print(f"Total number of instances: {counter}")
print(f"First SMS:\n{sms_texts[0]}")
print(f"First class:\n{repr(sms_classes[0])}")

Total number of instances: 5573
First SMS:
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
First class:
'ham'


All the elements in the program above are already known to us. Let's go briefly over them to make sure everything is clear.

- First we initialize our data structures, which are the two lists where we'll store our clean data after processing it. We must always initialize data structures before modifying them, otherwise, Python will produce an error.
- Then we open the file and start reading its text, line by line. The variable `file_handle` is an **iterator**, so when we place it in a `for` loop it returns item after item until no more elements are left.
- Next, we split each text line with a string **pattern**. In our case, we know that the same text is always repeated between each SMS and its class: `". The class of this SMS is: "`, including a space at the end of the string. 
- Afterwards, we remove the last character `\n` from the string `ham\n` and store the *cleaned* string `ham` back in the variable `sms_class`. For this we *slice* the string; remember when we saw *slicing*? We use the command `sms_class[:-1]`, which is the same as `sms_class[0:-1]`, so the start index is `0` and the end index is `-1`. In other words, we keep all the characters in the string, from index `0` and up to, *but not including*, the last index to remove the last character `\n`.
- Finally, we append the string `sms_text` in the list `sms_texts`, and the string `sms_class` in the list `sms_classes`.
- Notice that before the `for` loop, we initialize a `counter` and then increase its value by `1` after each `for` loop. In this way, we can count how many lines are in our text file.

for index, line in enumerate(file_handle):

#### Exercise:
1. Copy the code of the cell above into the cell below.
2. Replace the two lists `sms_texts` and `sms_classes` with four new lists:
  - A list for storing the text of ham messages.
  - A list for storing the text of spam messages.
  - A list for storing the class of ham messages.
  - A list for storing the class of spam messages.
3. Count how many *ham* and how many *spam* elements are in the dataset and print the answers.
4. Run the cell to show the results.

- **Going further**: Instead of having four separate lists, create a dictionary with four keys. Each key should be a string with the name of the variable, its corresponding value should be the list with data. For example, `sms_texts = []` and `sms_classes = []` would become `data_dictionary = {"sms_texts": [], "sms_classes": []}`. Therefore, `sms_texts.append(sms_text)` would become `data_dictionary["sms_texts"].append(sms_text)`.

#### 5.2.3. Write

The last step of our text data processing is to save the data.
Below you can see that we repeat the same processing we did before, and afterward we create two new files to save the SMS texts and the SMS classes separately.

Also, notice that when we write each line, we append a new line character `\n` at the very end, so that the next line we write starts in a new line below. There are two ways for adding `\n` at the end of the SMS text, by inserting into the SMS text or by appending it at the end of the SMS text.
To insert it, we can use the formatting method that we have used before: `f"{sms_text}\n"`, and to append it we can use the `+` sign as it concatenates the two strings on its left and right sides: `sms_text + "\n"`.
Both methods produce the same result, so we use the `+` method as in this case is the simplest.

In [4]:
# Preprocess data:
sms_texts = []
sms_classes = []
with open("../data/datasets/spam.txt", "r") as file_handle:
    for line in file_handle:
        sms_text, sms_class = line.split(". The class of this SMS is: ")
        sms_class = sms_class[:-1]
        sms_texts.append(sms_text)
        sms_classes.append(sms_class)

# Save data:
filename = "../data/datasets/sms_texts.txt"
data_list = sms_texts
with open(filename, "w") as file_handle:
    for item in data_list:
        line = item + "\n"
        file_handle.write(line)

# Save data:
filename = "../data/datasets/spam_classes.txt"
data_list = sms_classes
with open(filename, "w") as file_handle:
    for item in data_list:
        line = item + "\n"
        file_handle.write(line)

You can also see that instead of typing the file path inside the `open()` function, we define it as a variable above it. Also, instead of using a different variable name for the list inside the `for` loop, we pass it the same variable name `data_list`.
This optimization of code to make it more reusable is called code **refactoring**, and replacing values inside pieces of code with variables is called **extracting** parameters.
More concretely, we can reuse the code for writing a file if we specify the `filename` and the `data_list`.

#### Exercise:
1. Copy the code of the cell above into the cell below.
2. Below the code, create a *function* with the arguments `filename` and `data_list`.
3. Below the function definition, call the function two times. The first time, provide it with the parameters to save the SMS texts, and the second time, provide it with the parameters to save the SMS classes. For example, if you name the function save_data(), you should call it as follows: `save_data(filename, data_list)`.
4. Run the cell to execute the code.

- **Going further**: If you feel more adventurous, create a class called `DataCenter`. The class should have two methods:
  1. `preprocess_sms()` reusing the code we wrote for processing our SMS data. It should take as parameters the `filename` to open, and the `cut_pattern` used to split the *text* from the *class*. Finally, it should return two lists: `sms_texts` and `sms_classes`.
  2. `save_data()` reusing the code we wrote for saving text data.

Then create an instance of the class, call one time the `preprocess_sms()` method, and call two times the `save_data()` method to save the SMS texts and the SMS classes.

### 5.3. CSV Data

Reading and writing took lots of steps in the code we used above. Considering that all these steps are repeated often, it would be a good idea to standardize the read-and-write procedures in a single module. What about standardizing also the *cut pattern* that we use to separate the different features in our data? This is where Comma-Separated Values or **CSV** files come in handy. Furthermore, to simplify the read-and-write procedures, we'll use **Pandas**, an external library that specializes in tabular data. Normally, you have to install external libraries manually, but in our case, Anaconda comes with all the Data Science libraries we need, including Pandas.

Reading and writing took lots of steps in the code we used above. However, all these steps are repeated often, so it would be a good idea to standardize the read-and-write procedures in a reusable function. But what about the cut pattern we use to separate different features in our data? Fret not, we can standardize these too, and that's where Comma-Separated Values (CSV) files come in handy. 

Yes, but isn't this going to be complicated? No, it doesn't have to be, because we'll use Pandas 🐼, an external library that specializes in tabular data. While you normally would have to install external libraries manually, Anaconda is prepared to provide you with all the Data Science libraries we need. 

So, let's get into coding and import this useful Pandas library.

#### 5.3.1 Read

Let's go straight to an example.

In [5]:
import pandas as pd

data = pd.read_csv("../data/datasets/spam.csv")

print(f"Type of data: {type(data)}")
data.head()

Type of data: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,sms,class
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


Done! Great isn't it! 😊

When we write `import pandas as pd`, the `pd` is just a way to assign a shorter name to the module. This helps to make the code look cleaner and saves time when writing lots of code. For example, when calling the method to read CSV files, now we can write `pd.read_csv()` instead of the longer `pandas.read_csv()`.

Finally, when you call the method `.head()` of the object `data`, it displays the top five rows of your data with a pretty-looking table that makes it easier to inspect the data. By the way, the data type of the object returned by `pd.read_csv()` is called `DataFrame`. You can inspect it with the function `type()` as before:

#### 5.3.2 Transform

Below you can see how `spam.txt` changes from `spam.csv`, the CSV version of our dataset. We'll print two lines of the CSV file as the first line has the name of each feature (`sms` and `class`). In the second line you can see that the cut pattern `. The class of this SMS is: ` is gone. This pattern is no longer needed, as it has been replaced by other markers, such as placing the features (texts and classes) between double quotes (`" "`), and by separating each feature with a comma (hence the name CSV). This means that we no longer need to transform our data, this is already done when we use CSV conventions.

In [6]:
with open("../data/datasets/spam.txt", "r") as file_handle:
    line = repr(file_handle.readline())
    print(line, "\n")

with open("../data/datasets/spam.csv", "r") as file_handle:
    line = repr(file_handle.readline())
    print(line)
    line = repr(file_handle.readline())
    print(line)

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat.... The class of this SMS is: ham\n' 

'"sms","class"\n'
'"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...","ham"\n'


#### 5.3.3 Write

Finally, we can also save our features (texts and classes) as individual files. For this, we select the respective column in our dataframe and use Pandas to save it.

#### Note:
We are importing pandas again and reading the data again, even though we already did this in the previous cell and the data is already in memory. We did the same when we defined our custom functions and we’ll keep repeating these loads and definitions. Why? Simply to make the cells in the notebook stand-alone, so you can run them independently and without having to run all the cells above first. 

In [7]:
import pandas as pd

data = pd.read_csv("../data/datasets/spam.csv")

data.to_csv("../data/datasets/sms_texts.csv", columns=["sms"], index=False)
data.to_csv("../data/datasets/sms_classes.csv", columns=["class"], index=False)

Two lines of code, amazing. The `.to_csv()` can take additional parameters, like `columns` to indicate a list with the names of the columns that you want to save, or `index` to indicate if you want to add index numbers in your CSV file or not. How can we know all the parameters that are possible to pass to this method? The best way to understand a library is through its documentation. It may look a bit complex, but after a few minutes, you will become comfortable reading it.

For example, [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) you can find the documentation for `.to_csv()`. At the top, you can see the definition of the method, and the default value for its arguments. Below you will find a detailed description of each parameter and at the bottom some example exercises.

Part of being a software developer is reading the libraries' documentation and searching for answers in technical forums. Ask any software developer engineer, and they will tell you they do this on a daily basis! It takes a bit of time to get used to these kinds of documents and navigate them confidently, but once you come to grips with their common structure, it becomes much easier and more enjoyable. Seriously, it's a promise 😉

So let's get some practice and do the exercise below.

#### Exercise:
1. Google "pandas read excel" and "pandas write dataframe to excel", and find the pages in the official Pandas documentation that describe these two methods.
    - **Tip:** The main page of the Pandas' official documentation is https://pandas.pydata.org/pandas-docs/stable/reference/
2. At the top of the documentation pages, you can find the **signature** of the methods. The signature shows you the parameters that you can pass to the methods. Read the arguments that each function has, and notice the default values of each.
3. At the top of the documentation pages you can find examples of how to use the methods.
3. In the cell below, load the file `"../data/datasets/spam.xlsx"` and
    - save the column `"sms"` in a file called `"../data/datasets/sms_texts.xlsx"`
    - save the column `"class"` in a file called `"../data/datasets/sms_classes.xlsx"`

### 5.4. JSON Data

Finally, we'll see another popular format used to transfer data between processes, databases, and to communicate between different programming languages: JSON (JavaScript Object Notation).
More concretely, JSON is a standard file and data-interchange format that uses human-readable text to store and transmit data objects consisting of **attribute–value pairs** (like **key-value pairs** in Python dictionaries) and **arrays** (like **lists** in Python).
The good news is that the JSON format is very similar to Python dictionaries, so you are already familiar with its structure.
When a JSON only has one item, Python converts it into a dictionary, and when it has multiple items, Python converts it into a list of dictionaries. These are examples of JSON objects:

```
single_user = {
    "name": "John",
    "last_name": "Smith",
    "age": 24,
    "hobby": {
        "reading" : true,
        "gaming" : false,
        "sport" : "football"
    },
    "children" : ["Peter", "Laura"]
}

multiple users = [
    {"name": "John", ...},
    {"name": "Jessica", ...},
    {"name": "Peter", ...}
]
```

Let's go over some subtle differences between JSON and Python:
- Wwhat in Python is a dictionary (`dict`), in JSON is called an `Object`.
- A `list` in Python is called an `Array` in JSON.
- The value `None` in Python is called `null` in JSON.
- `True` and `False` have the first letter capitalized, but in JSON they are all lowercase.

| Python |  JSON  |
|:------:|:------:|
|  dict  | Object |
|  list  |  Array |
|  True  |  true  |
|  False |  false |
|  None  |  null  |

#### 5.4.1 Read


To load JSON data into a Python dictionary (or a list when there are multiple key-value pairs), you need to import the module `json`.
You can use this module to read a file or to decode information in a text string.
To start, let's read the same dataset we used in the CSV example (the SMS ham/spam one 😉), but in JSON format.
For this, you can use the method `.load()` 

In [8]:
import json

# Read a FILE:
with open("../data/datasets/spam.json", "r") as file_handle:
    data = json.load(file_handle)
    print(f"The type of data is: \n {type(data)}")
    print(f"The number of elements in the data is: \n {len(data)}")
    print(f"The first element in the data is: \n {data[0]}")

The type of data is: 
 <class 'list'>
The number of elements in the data is: 
 5573
The first element in the data is: 
 {'sms': 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'class': 'ham'}


Wow, the results for the first element in the data looks crammed and not an easy read!. But as you should know by now, in programming languages, there’s always a solution 😉. If you would like to improve the way dictionaries are printed, you can import the module `PrettyPrinter` from the library `pprint`. Once you import this module: 
- Create an instance of it in the same way you created instances of classes before. For example, `pp = PrettyPrinter()` creates an instance called `pp`.
- Then, use the instance object to call the method `pprint()`. For example, you can execute `pp.pprint(dictionary)` to print more *prettily* the contents of the dictionary.

Take a look at the following example:

In [9]:
import json

from pprint import PrettyPrinter
pp = PrettyPrinter()

# Read a FILE:
with open("../data/datasets/spam.json", "r") as file_handle:
    data = json.load(file_handle)
    print("The first element in the data is:")
    pp.pprint(data[0])

The first element in the data is:
{'class': 'ham',
 'sms': 'Go until jurong point, crazy.. Available only in bugis n great world '
        'la e buffet... Cine there got amore wat...'}


Much better, isn’t it? You can now instantly see the `class` in the first line, and separately the `sms` text on the second. 

#### 5.4.2 Transform

Great, the data is in a dictionary and you can read it, so why don’t we look at how to process it. For example, let’s add a feature called `binary_class` with the value `0` when the text is `spam`, and with the value `1` when the text is `ham`. For this, we can iterate over our list of dictionaries with a *for* loop, or use pandas. 

Frequently, data has missing values, so it could be the case that some texts have no `class` value.
It is important to catch these cases and decide what to do with them, otherwise, your program could crash when trying to add a number with a `None` for example. 
There are multiple alternatives to solve this issue. You could remove the rows that have any missing value (although you may end up with no data!), or you could replace them with educated guesses, for example with the average value in the column, or the average between the two adjacent rows. This process of filling in missing values is called **imputation**.
Observe in both of the methods below the alternative ways to verify that your data is complete.

In [10]:
# Iterating over a list of dictionaries with a for loop

import json

from pprint import PrettyPrinter
pp = PrettyPrinter()


# LOAD the data:
with open("../data/datasets/spam.json", "r") as file_handle:
    data = json.load(file_handle)

# INSPECT the data:
print(f'The type of "data" is:\n  {type(data)}')
print(f'The type of each item in "data" is:\n  {type(data[0])}')
print("\nThe first element in the ORIGINAL data is:")
pp.pprint(data[0])

# TRANSFORM the data:
for index, item in enumerate(data):
    if item["class"] == "spam":
        data[index]["binary_class"] = 0
    elif item["class"] == "ham":
        data[index]["binary_class"] = 1
    else:
        # Warn if there are missing values:
        print(f'The class must be "spam" or "ham". The class of item {index} is: {item["class"]}')

# VERIFY the data:
print("\nThe first element in the TRANSFORMED data is:")
pp.pprint(data[0])

The type of "data" is:
  <class 'list'>
The type of each item in "data" is:
  <class 'dict'>

The first element in the ORIGINAL data is:
{'class': 'ham',
 'sms': 'Go until jurong point, crazy.. Available only in bugis n great world '
        'la e buffet... Cine there got amore wat...'}

The first element in the TRANSFORMED data is:
{'binary_class': 1,
 'class': 'ham',
 'sms': 'Go until jurong point, crazy.. Available only in bugis n great world '
        'la e buffet... Cine there got amore wat...'}


When we create our new `binary_class` feature, we don't want to modify the temporary variable returned in the `for` loop, but we want to modify the original data object.
To do this, we use the function `enumerate()` that we introduced in notebook '2. Flow and Functions*.
This function helps us because it returns the `index` and the `item` of the list in each iteration, so we evaluate our logical conditions on the `item`, and then we modify the corresponding item in `data` by pointing to it with its `index`.
Notice that you can use consecutive pairs of brackets `[]` to go deeper in the data structure. For example:
- `data` returns a list.
- `data[index]` returns a dictionary.
- `data[index]["binary_class"]` returns a number.

Now let's see how to perform the same procedure in a simplified way by using Pandas.

In [11]:
# Applying a function to a Pandas data frame

import pandas as pd


# LOAD the data:
data = pd.read_json("../data/datasets/spam.json")

# INSPECT the data:
print(data.head(), "\n")

# TRANSFORM the data:
data["binary_class"] = data.apply(
    lambda row: 0 if row["class"] == "spam" else 1,
    axis=1
)

# VERIFY the data:
print(data.head(), "\n")

# Warn if there are missing values:
nan_values = data["binary_class"].isnull().sum()
print(f"Number of missing values: {nan_values}")

                                                 sms class
0  Go until jurong point, crazy.. Available only ...   ham
1                      Ok lar... Joking wif u oni...   ham
2  Free entry in 2 a wkly comp to win FA Cup fina...  spam
3  U dun say so early hor... U c already then say...   ham
4  Nah I don't think he goes to usf, he lives aro...   ham 

                                                 sms class  binary_class
0  Go until jurong point, crazy.. Available only ...   ham             1
1                      Ok lar... Joking wif u oni...   ham             1
2  Free entry in 2 a wkly comp to win FA Cup fina...  spam             0
3  U dun say so early hor... U c already then say...   ham             1
4  Nah I don't think he goes to usf, he lives aro...   ham             1 

Number of missing values: 0


Let’s unpack what we’ve done. Here we use the method `.read_json()` to load the JSON data directly into a Pandas data frame named `data`.
Afterward, in `# TRANSFORM the data` we create a new column with the same notation used to create a new *key-value* pair in a dictionary: `dictionary["new_key"] = new_value` or `pandas_dataframe["new_column"] = new_series_of_values`.

To create the new series of values that will be assigned to our new column `binary_class`, we use the method `.apply()` on `data`.
This method applies a **function** iteratively over the **columns** or the **rows** of a Pandas data frame.

In the first parameter, we pass the function to `.apply()` with a syntax called **lambda function**.
Lambda functions are the same as regular functions, but have a shortened syntax that makes our code shorter and easier to read.
The syntax of *lambda* functions is: `lambda input: computation`. Notice that you start with the keyword `lambda` to define the function, and the result you get for each row is the value returned by the `computation`.
In our case, the computation is `0 if row["class"] == "spam" else 1`, so our lambda function returns `0` when the row `class` is `spam`, and returns `1` when the row `class` is `ham`.

In the second parameter passed to `.apply()`, we indicate that we want to iterate over rows.
`.apply()` iterates over columns when its parameter `axis=0`, and iterates over rows when its parameter `axis=1`.
As we want to check the value of `class` in each row, we set `axis=1`.

And there you go, you now know how to process data with JSON, and scan any dataset for missing values. 😉

#### 5.4.3 Write

Finally, we are ready with our new data, the next step is to save it back as a JSON. These are the methods for storing dictionaries and Pandas data frames:

In [12]:
# Save a dictionary as a JSON

import json


with open("../data/datasets/spam.json", "r") as file_handle:
    data = json.load(file_handle)

for index, item in enumerate(data):
    if item["class"] == "spam":
        data[index]["binary_class"] = 0
    elif item["class"] == "ham":
        
        data[index]["binary_class"] = 1
    else:
        pass

# SAVE the data:
with open("../data/datasets/spam_dictionary.json", "w") as file_handle:
    json.dump(data, file_handle, indent=2)

To have more clarity in this example code about saving data, we skip the step of checking for missing values, but you should always perform it Flow control elements (for, if, else, etc.) can’t be left empty, they always must have an instruction. Therefore, we write `pass` to tell Python that we don’t want to do anything and simply continue to the next step in the code.

In [13]:
# Save a Pandas data frame as a JSON

import pandas as pd


data = pd.read_json("../data/datasets/spam.json")

data["binary_class"] = data.apply(
    lambda row: 0 if row["class"] == "spam" else 1,
    axis=1
)

# SAVE the data:
data.to_json("../data/datasets/spam_dataframe.json", indent=2, orient="records")

The last line shows the method to save a Pandas data frame as a JSON, and it has two additional parameters.
`indent=2` adds two indentation spaces to make the JSON file easier to read, and `orient="records"` tells Pandas to store it with the common pattern for data records.

JSON is a universal translator between programming languages and you'll encounter it often when receiving and sending data. Now you are ready to start communicating with databases around the world! 😎

#### Exercise:
1. Open the JSON file in a simple text editor. Don't use Excel, but instead use the default text editors **textEdit** in Mac, or **Notepad** in Windows.
2. Open the [official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html) for `.to_json()` and find the alternative values that you can pass to the parameter `orient`.
3. Copy the code of the cell above into the cell below.
4. Save the data with all the different `orient` values (one line of code per value) and compare the output format in your text editor.
4. Run the cell to execute the code.