![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Module 3 Unit 3  - Cleaning and Filtering Data Sets

*This section contains tutorials! Get the most out of them by opening a Jupyter notebook in another window and following along. Code snippets provided in the course can be pasted directly into your Jupyter notebook. Review Module 2, Unit 5 for a refresher on creating and opening Jupyter notebooks in Callysto.*

The more accurate and representative our data set is, the more useful it is for data analysis. However, data sets often come with errors — mistakes made by people collecting or entering data, or caused by computer glitches when saving, copying, or transmitting data.

When doing data science, it's always a good idea to review our data and filter out faulty observations.

In this unit, we'll explore ways to:

* Select and view particular data in a DataFrame
* Add and remove rows
* Reorder rows
* Modify values
* Replace values
* Find outliers

The activities in this unit use a *coin_df* DataFrame similar to the one we created earlier in the course.

Create *coin_df* by running the code below.

    from pandas import DataFrame
    data = {'name': ['penny', 'nickel', 'dime', 'quarter'],
     'value': [1, 5, 10, 25],
     'weight': [2.35, 3.95, 1.75, 4.4],
     'design': ['Maple Leaves', 'Beaver', 'Schooner', 'Caribou'] }
    coin_df = DataFrame( data )
    coin_df


![coin df](../_images/Module3-Unit4-image.png)

*A demonstration of how to show the output of a completed DataFrame, using Python programming. The Python programming used to show the completed DataFrame about the penny, nickel, dime, and quarter was "coins_df."*


#### Activity: Common functions

This table provides some common functions and what they look like as a line of code. Try each one out and see what kind of output they produce.

| Operation                                         | Description                                                       |
|---------------------------------------------------|-------------------------------------------------------------------|
| `coin_df['name']`                                | Select (and print out) a column                                    |
| `coin_df[['name','weight']]`                     | Select two columns                                                 |
| `coin_df.loc[1]`                                 | Select a row                                                       |
| `coin_df.loc[1,'name']`                          | Select a single value in row/column                                |
| `coin_df.loc[1,'name'] = 'NICKEL'`               | Change a single data point in row/column                           |
| `coin_df[coin_df['weight']>3]`                   | Select all the rows where the weight of the coin is greater than 3 |


## Adding another row to the DataFrame

If we want to expand our data set, we can use the loc command to add a new row and specify its index and values. Let's try that now.

Run the code below in your Jupyter notebook to add another row to the coin_df DataFrame.

    coin_df.loc[4] = ['50-cent piece', 50, 6.9, 'Coat of Arms']
    coin_df
    
The outcome should look something like this:

![coin df 2](../_images/Module3-Unit4-image1.png)

*A DataFrame containing data on the value, weight and design of various coins. The first five include 'penny', 'nickel', 'dime', 'quarter', '50-cent piece'. Each row contains information for each of these coins.*

In this example, we set the row index for the new coin to be 4, but when adding a row we can actually set it as nearly any unused rational number. For example, we could make the row index 7, 42, 5000, -71, or 3.14592 (pi to the 5th decimal).


#### Dropping a row from the DataFrame

Conversely, if we want to remove some data from our DataFrame, we can use the drop function. For instance, the following line of code will remove the row with index 0.

    coin_df.drop(index=0)
    
![coin df 3](../_images/Module3-Unit4-image2.png)

*A DataFrame containing data on the value, weight and design of various coins after the 'penny' row has been removed. The first three include 'nickel', 'dime', 'quarter'. Each row contains information for each of these coins.*

Notice that this is actually a new DataFrame with one fewer row than the DataFrame coin_df. If you want to change coin_df itself, use the inplace option:

    coin_df.drop(index=0, inplace=True)
    
    
#### Reordering rows in the DataFrame

Sometimes it's helpful to order rows in a data set according to the data, rather than their default row index number. For example, we might want to display our coin data in alphabetical order.

The **reindex** function lets us specify a particular row order.

    coin_df.reindex([2,1,0,3])
    
Try this in your own notebook now. The outcome should look like this:

![coin df 4](../_images/Module3-Unit4-image3.png)

*Demonstration of sorting coins DataFrame by name, using alphabetical order. Order of rows changes to reflect: data for dime, then data for nickel, then data for penny, then data for quarter. Note that data for 50-cent piece does not appear here.*

#### Modifying values in the DataFrame

Earlier in the course, we explored how to modify an entire column of entries at once by applying a simple mathematical formula, similar to this one:

    coin_df['value'] = coin_df['value']/100
    coin_df
    
![coin df 5](../_images/Module3-Unit4-image4.png)

*Demonstration of sorting coins DataFrame by weight, in ascending order. The lowest value is 0.01 corresponding to penny, followed by 0.05 corresponding to nickel, followed by 0.10 corresponding to dime, followed by 0.25 corresponding to quarter, followed by 0.50 corresponding to 50-cent piece.*

This method is good for a bulk modification to numbers, but what if we want to modify text values, also known as strings?

#### 🏷️ Key Term: String
>In Python, a string is a specific sequence of characters. Any value in a data set that is not a number is a string, such as a name or label.

For this we can use the map command.

For example, right now all the text in our DataFrame is lowercase — except for the values under the design column. The following command shows what the text looks like in lower case.

    coin_df['design'].map(str.lower)
    
![coin df 6](../_images/Module3-Unit4-image5.png)

*Demonstration of turning all words under the desing column into lower case. Sorted by value. Printed on screen: 0 (row index) maple leaves, 1 (row index) beaver, 2 (row index) schooner, 3 (row index) caribou, 4 (row index) coat of arms. Name of column: design. Type object.*

Remember, this doesn’t change the original DataFrame, it just outputs the result. If you are happy with this result, then you can store it back in the DataFrame, like this:

    coin_df['design'] = coin_df['design'].map(str.lower)


#### Lambda functions (Optional)

Lambda functions allow us to use a function as a parameter to another function, like the map function mentioned earlier in the course.

For instance, suppose we needed a function that would let us double all the weight values in our data set.

We could start by defining a function called *Doubler* and pass it to the map function, like this:

    def Doubler(x):
        return x+x
    coin_df['weight'].map(Doubler)
    
However, a more succinct way is to represent the Doubler function in the call to map, like this:

    coin_df['weight'].map(lambda x: x+x)
    

So this way, the Doubler function is represented in the **map** function as a parameter, using the form of a lambda notation.

We call this type of function an **anonymous function,** because we never define it with a specific name.

In a more useful example, we might like to specify a numerical function that converts units for the weights.

The lambda function defined via the statement:

    lambda x : x*28.35
    
will convert ounces to grams. We can apply this to the weight column, with the **map** function calling up the lambda function:

![coin df 7](../_images/Module3-Unit4-image6.png)

*Demonstration of changing the value under weight column for each column using lambda functions. Using the .map() method we can change the weight that appears for each column as follows coins_df['weight'].map(lambda x: x*28.35). This will multiply each value under the 'weight' column by 28.35.*



##### 🏁 Activity: ?





### Conclusion

In this unit, we showed how to access data inside a DataFrame, modify that data, add more information to the DataFrame, and begin some calculations on the data. We also showed how to plot data from the DataFrame, which is our first step towards creating powerful data visualization. 

The next unit will address more complex ways to manipulate data in a DataFrame which are useful when cleaning and filtering our data. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)