# Pandas Data Cleaning

In [2]:
import pandas as pd

## Review: Independent Exercise

Load the `drinks.csv` data.  

Perform the following:  

1. Print the head and tail.
2. Look at the index, columns, dtypes, and shape.
3. Assign the beer_servings column/Series to a variable.
4. Calculate summary statistics for beer_servings.
5. Calculate the mean of beer_servings.
6. Count the values of unique categories in continent. (.value_counts)
7. Print the dimensions of the drinks DataFrame.
8. Find the first three items of the value counts of the column of your choice.

## Changing data types

#### Check the datatypes of the dataframe

#### Change the datatype of the `beer_servings` column to floating point

## Filtering and Sorting DataFrame

#### Filter drinks to include only European countries.

First we create a series of Booleans

Then we can use this series to filter our dataframe. (This is why we see the `drinks` twice.)

#### Filter drinks to include only European countries with wine_servings > 300.

#### Filter drinks to include only countries with wine_servings > 300 or beer_servings > 300.

#### If we find ourselves gluing together a bunch of "OR" statements, we can use `.isin` to create a boolean series to pass into the dataframe

#### Calculate the mean beer_servings for all of Europe.

#### Determine which 10 countries have the highest total_litres_of_pure_alcohol.

#### Which 10 countries have the lowest total_litres_of_pure_alcohol?

Side note: This does not change the underlying data. How can we change the underlying data?

#### Let's sort by multiple columns. First sort by `beer_servings` then by `wine_servings`.

## Renaming, Adding, and Removing Columns

#### Rename `beer_servings` as `beer` and `wine_servings` as `wine` in the `drinks` DataFrame, returning a new DataFrame.

#### Perform the same renaming for `drinks`, but in place.

#### Replace the column names of drinks with `['country', 'beer', 'spirit', 'wine', 'liters', 'continent']`.

#### Replace the column names of drinks with ['country', 'beer', 'spirit', 'wine', 'liters', 'continent'] when you import the file.

#### Bonus Tip: What if we have a lot of columns where we want to replace spaces with underscores?

#### Make a `servings` column that combines `beer`, `spirit`, and `wine`.

#### Make an `mL` column that is the `liters` column multiplied by 1,000.

#### Remove the `mL` column, returning a new DataFrame.

#### Remove the `mL` and `servings` columns from drinks in place.

#### What if we want to remove rows instead of column?

## Axis parameter

#### `axis=0` goes row by row and collapses the values into the mean

#### `axis=1` goes column by column and collapses into the mean  for each row (It helps me to think of the number 1 looking like an architectural column)

#### `axis` has aliases/nicknames that are a bit more intuitive

## Handling Missing Values

#### Create a dataframe of Booleans indicating which values are missing or not missing.

#### Find the number of missing values by column in `drinks`.

#### Drop rows where ANY values are missing in `drinks` (returning a new DataFrame).

#### Drop rows only where ALL values are missing in `drinks`.

#### Filling in NaN Values. What's up with all of these NaN continents?

All of these continents are in North America (NA), and, when read in, were misinterpreted as a null or NaN value.

#### Fill in the missing values of the `continent` column using string 'NA'.

# Independent Exercise

#### Using the UFO data ("ufo.csv")

1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state VA.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where city is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of city and state.


**Bonus:** Drop rows where City or Shape Reported is missing.

We'll return to missing values when we talk about preprocessing!

## Split-Apply-Combine

#### Find the mean beer servings across the entire `drinks` dataset

#### But what if we wanted to look at beer servings by continent? This is where`.groupby()` is useful. This filters by each continent and then calculates the mean.

Use a `.groupby()` whenever you want to analyze a dataset by some category. If you can phrase your question as "For each...", then it is a good candidate for a `.groupby()` For example, "For each continent, what is the mean beer serving?"

#### What happens if we don't specify a column? Let's find the max of all the columns

#### Using the `.agg` function we can specify multiple functions at once for our `.groupby()`

## String methods

#### You can use Python's string methods with pandas by using `.str` beore the name of the string method. Remember that many of these string methods use regular expressions. 