# An Introduction to Cleaning Your Data: Explore and Experiment with DataFrames

Now that you've learned the basics of Python in Codeacademy, we are going to explore and experiment in this Jupyter Notebook.  
I've laid out 7 steps that you'll need to clean data; this is a basic structure so that you learn the various functions that you'll need later on.  
The goal here is to mess around with all of these functions and learn through trial-and-error.

To clean the data we need to:
1. **Get our notebook ready** to clean the data.
2. **Import** the raw data.
3. Understand **what the data are** and **how the data are stored**.
4. **Change data** that are not stored in the way that best fits our needs.
5. **Delete data** that we do not want.
6. **Create data** that we need.
7. **Export** the cleaned data. 

## 1. Get Our Notebook Ready

We always import pacakges at the the top of our script.  
For now, the only packages that we need are `pandas`, `numpy`, and `os`.  
- pandas will help us interact with your data as a table, people call tables in pandas **dataframes**
- numpy is our math package; this will help us alter and create data
- os helps us define filepaths and load data into our notebook

**To import each of these packages**, you want to run all three lines of code within the cell.  
To run these lines of code, select anywhere in the cell and hit `shift + enter` (you don't need to highlight the code).  
As the code runs, you'll see an asterisk [*] appear in the brackets to the left of the cell, and once it is done, you'll see a number [1].   
That number means that the cell ran all three lines of code and all of those packages are imported into your notebook.

In [1]:
import pandas as pd
import numpy as np
import os

**Notice** that I imported `pandas as pd` and `numpy as np`; I am just changing the name that I use to refer to the package to save time later.

---
Quick tangent on interacting with Jupyter Notebooks:  
1. To add a cell below this one, click on this cell, and hit `b`.
2. Click on your new cell and hit `a` to add a cell above it; if you just typed an `a` in the cell, rather than adding a new cell, hit `esc` and then hit `a`
3. To delete these new cells, click the cell and hit `dd` (`d` twice)

Create a new cell and turn it into a Markdown cell by pressing `m`. Markdown is a syntax to create formatted text.  
Type some stuff out and run the cell (`shift + enter`); here is a [link](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) for info on how to format Markdown text.  
Now click on the link and turn in back into a code cell by pressing `y`.

---

## 2. Import the Raw Data

To import the raw data, we need to tell JupyterLab where to find it.  
To direct JupyterLab to the data file, we need to define a file path.  
To define a file path, we need to know how to interact with our packages.  
- `os` is a **package**; packages are a collection of **modules**.    
- `path` is a **module** within the package; modules are a collection of **functions**.  
- `abspath()` is a **function** within the path module, within the os package.  
notice the parantheses at the end of `abspath()` &rarr; this means it is a function; functions take an input and will do something or create something

`os.path.abspath()` with `''` as an input will tell me what file directory I am currently working in.

**To figure out what file directory (or folder) we are currently working in...**
1. Click on the cell below.
2. Type `os.path.abspath('')` into the cell.
3. Hit `shift + enter` to run the line of code

This prints out the file directory that we are currently working in to JupyterLab.  
If we look in our finder, we can see that this filepath refers to where or Jupyter Notebook file is saved.

If we print something out to JupyterLab, it isn't saved anywhere.  
I want to save the filepath to refer back to later.  

**To save the filepath...**
1. Type `current_directory = os.path.abspath('')`
2. Hit `shift + enter` to run the line of code

Notice that nothing prints out into JupyterLab; let's print out the data again to make sure it saved correctly.

**To see if it saved correctly...**
1. Type `current_directory` and run the code

---
Now the data that we created using the **function** `os.path.abspath()` given the **input** `''`, is saved in the **variable** `current_directory`.

---

Next, we want to figure out where the data are **relative** to our current directory. These are called **relative paths**; defining filepaths relative to where your notebook is is really important for other people to be able to run your code.  
Imagine opening your finder and locating this Jupyter Notebook file; when we create a **relative path**, it is like giving someone instructions on how to navigate from this Jupyter Notebook file to the raw data file.  

**First, we need to go up to the GitHub folder that our clean and data folder are in by using `dirname()`; then we'll specify where the data file is from the GitHub folder using `join()`.**  
1. Create the variable `analysis_directory` by running `analysis_directory = os.path.dirname(current_directory)`
2. Create the variable `data_filepath` by running `data_filepath = os.path.join(analysis_directory, 'data', 'raw_data', 'YOUR FILE NAME')`
    - YOUR FILE NAME should look something like `survey_data.xlsx` or `data_from_study.csv`

3. print the `data_filepath` to make sure that the filepath was specified correctly...

Now that we have defined where the data are, we are going to use `pandas` to load the data into our notebook.
If you have an excel file, you are going to use the `read_excel()` function; if you have a csv file, you are going to use the `read_csv()` function in `pandas`.  
These read functions take the `data_filepath` as the input.

**To import your data in JupyterLab...**
1. Type `raw_data = pd.read_csv(data_filepath)`
    - swap out `read_csv` with `read_excel` if needed
2. Print the data to make sure that it loaded correctly

Now you have your data in JupyterLab! It is stored as a pandas **[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)** .  
DataFrames have an **[Index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html)**; the indices are the row numbers and the column names.  
DataFrames are made up or rows as columns; each row or each column by itself is a **[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)** .  

**Let's save a copy of the data as `df` (short for DataFrame)**; this way, we'll always be able to refer back to the raw data without loading it back in from the file.  
Run the line of code below and from here on, we'll only mess around with our data as `df`.

In [None]:
df = raw_data.copy()

## 3. What Are the Data & How Are They Stored
---
**Now that we are here, a quick note on error messages:**  
You will make *a lot* of errors and desciphering what went wrong will be confusing.  
When you get an error message, it will print out this long list of gibberish.  
Most of it is useless; the information that you want is:
1. Which line of code (that you wrote) is not working?
2. Scroll to the bottom of the error message...what is the error code?

Error codes won't be super useful when you start out, but they'll start to make sense as you read documentation and make similar mistakes over and over again :).

---

**The goal of this section** is to use a bunch of different functions to grab chunks of your data from your DataFrame.  
To change, delete, or manipulate data you'll need to be able to select portions of it.

For each of the data types:
1. Using the functions below, select and print a chunk of data into Jupyter Lab
2. To confirm that type of data that you've selected, run `type(your data selection)`

Data types to select:
- DataFrame
- Series
- String
- Integer
- Float

**These links will take you to documentation that provides more info about the function**  
[df.loc[]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html), 
[df.iloc[]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)  
I'd recommend looking at the examples (at the bottom) first.  
Then scroll back up to the top and learn what information goes into the function, what comes out, and what types of errors will come out if you make a mistake.  
The blue boxes contain functions that are related/do similar things.

Key functions:
- **Indexing** Select specific chunks using the indices [can return a single value, Series, or DataFrame]:
    - [`df.loc[name of your row, name of your column]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)
    - [`df.iloc[index of your row, index of your column]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html)
- Select a column [returns a Series]:
    - `df.your_column` or `df['your_column']` (either syntax works)
- Select multiple columns [returns a DataFrame]:
    - `df[your list of columns]` where the list of columns:
        - Can name columns explicitly ['column1', 'column2']
        - Or use **list comprehension** to choose columns using a conditional like `[i for i in df.columns if 'outcome' in i]`
- **Masking** Creating a mask [returns a DataFrame] like `df[conditional statement]` where conditional statement can be things like:
    - `df.your_column == your value`
    - `df.your_column != your value`
    - `df.your_column.isin([your list])`
    - `df.your_column.notnull()`
    - `df.your_column.isnull()`
    
---
**Notice** which of these **functions** use brackets `[]` rather than parentheses `()`; functions that slice/select data use brackets &rarr; check out this [link](https://lerner.co.il/2018/06/08/python-parentheses-primer/) for more detailed info on when to use `[]`vs.`()`vs.`{}`

**Notice** that these functions are called right on the data `df.rename()`, rather than being called from a module like `os.path.abspath()`. DataFrames have **attributes**; attributes are functions that work directly on the data.

---

## 4. Change Data
**The goal of this section** is to use the functions/techniques below to change the data in your DataFrame.  
Try to use each of the functions below to change a chunk of data in your DataFrame.

**These links will take you to documentation that provides more info about the function**  
[df.rename()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html), 
[df.replace()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html), 
[df.your_column.map()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)

Key functions:
- [`df.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html)
- [`df.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)
- [`df.your_column.map()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)
- `df.loc[your row_name, your column_name] = your value`

## 5. Delete Data
**The goal of this section** is to use the functions/techniques below to delete data in your DataFrame.  
Try to use each of the functions below to remove a chunk of data from your DataFrame.

**These links will take you to documentation that provides more info about the function**  
[df.drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)

Key functions:
- [`df.drop()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
- Save the dataframe as only portion of the data using a mask or indexing
    - `df = df[df.your_column > your value]`
    - `df = df.loc[:, ['column1', 'column2']]` &rarr; `:` returns all rows and you can use a list of columns with the `.loc[]` indexing function

## 6. Create Data
**The goal of this section** is to use the functions/techniques below to create/add data to your DataFrame.  
Try to use each of the functions below to add new data to your DataFrame.

**These links will take you to documentation that provides more info about the function**  
[np.where()](https://numpy.org/doc/stable/reference/generated/numpy.where.html)

Key functions (click links for info):
- `df['new column name'] = your value or some function that creates data` where functions to create data include:
    - [`np.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html)
    - list comprehensions:
        - `[i for i in range(len(df))]` &rarr; creates a list of values from 0 to n, where n in the length of your DataFrame
        - `[i if i > 1 else i*10 for i in df.your_column]` &rarr; creates a list of values based on another column and a conditional
        - `[df.loc[idx, 'column1'] if df.loc[idx,'column0'] == TRUE else df.loc[idx,'column2'] for idx in df.index]` &rarr; creates a list that combines the values from two columns based on a conditional

## 7. Export the Cleaned Data
Last, we'll export the "cleaned data" (here it might just be changed in a bunch of ways) to files in the finder.  
This is standardized code that we'll use later; here if you type in the date and the name of some scale and run the code, you'll see how everything saves.  

In [None]:
# ---- date in the form year, month, day (yyyymmdd)
date =
# ---- abbreviation of the scale (e.g. CEBQ)
scale = 

# ---- export to excel
df.to_excel(os.path.join(os.path.dirname(os.path.abspath('')), 'cleaned_data', '{}_cleaned_{}.xlsx'.format(date, scale)))

# ---- export to open source format
df.to_hdf(os.path.join(os.path.dirname(os.path.abspath('')), 'cleaned_data', '{}_cleaned_{}.h5'.format(date, scale)), 'data')

del date, scale