# **Filling missing values with pandas**

- When transforming data, you'll often come across missing values in a DataFrame, which are typically designated by NaN. 
- In the DataFrame above, rows two and three are both missing values in the "open" and "close" columns. 
- To remedy this, pandas offers the dot-fillna method. 
- In its most basic form, this method takes a value that is used to fill all NaN values in a DataFrame. 
- In our example, missing values in the "open" and "close" columns are filled with the value zero, and the result is shown below.

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	NaN	        0.086458
1997-05-19 13:30:00	122136000	0.088021	NaN

In [None]:
# Fill all NaN with value 0
clean_stock_data = raw_stock_data.fillna(value=0)

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	0.000000	0.086458
1997-05-19 13:30:00	122136000	0.088021	0.000000

- A dictionary can also be passed to the "value" parameter in the fillna method. 
- When axis is set to one, the key-value pairs represent column names, and the associated values are used to fill missing values in that column. 
- This expedites filling missing values across multiple columns. 
- In our example, all missing values in the "open" column are replaced with zero, and all missing values in the "close" column are replaced with point-five.

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	NaN	        0.086458
1997-05-19 13:30:00	122136000	0.088021	NaN

In [None]:
# Fill NaN values with specific value for each column
clean_stock_data = raw_stock_data.fillna(value={"open": 0, "close": .5}, axis=1)

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	0.000000	0.086458
1997-05-19 13:30:00	122136000	0.088021	0.500000

- A column can also be passed to the fillna method. 
- When that occurs, missing values are replaced with the corresponding values from the column that was passed. 
- Here, the "close" column is used to fill all missing values for the "open" column. 
- When the parameter in_place is set to True, the DataFrame is altered in-place, and the output does not need to be stored to a new variable.

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	NaN	        0.086458
1997-05-19 13:30:00	122136000	0.088021	NaN

In [None]:
# Fill NaN value using other columns
raw_stock_data["open"].fillna(raw_stock_data["close"], inplace=True)

In [None]:
timestamps	        volume	    open	    close
1997-05-15 13:30:00	1443120000	0.121875	0.097917
1997-05-16 13:30:00	294000000	0.086458	0.086458
1997-05-19 13:30:00	122136000	0.088021	NaN

# **Grouping data**

- In SQL, one of the most common transformations applied to data is done using "GROUP BY" functionality. 
- In this SQL statement, data is grouped by the "ticker" column, and the average of the remaining columns is taken. 
- Lucky for us, pandas offers this same functionality through their dot-groupby method.

In [None]:
SELECT
    ticker,
    AVG(volume),
    AVG(open),
    AVG(close)
FROM
    raw_stock_data
GROUP BY
    ticker;

- The `.groupby()` method can recreate the query above, using pandas

# **Grouping data with pandas**

- In a single line of code, the dot-groupby method groups the raw_stock_data DataFrame by the "ticker" column, and finds the mean of the other columns. 
- By passing zero to the axis parameter, we are grouping the DataFrame by row labels, which is standard practice. 
- If one is passed to axis, data is grouped by column labels. 
- The grouped DataFrame is stored to the grouped_stock_data DataFrame, and shown below. 
- In addition to the mean, pandas allows methods such as dot-min, dot-max and dot-sum to be used to aggregate the remaining columns.

In [None]:
ticker  volume     open     close
AAPL    1443120000 0.121875 0.097917
AAPL    297000000  0.098146 0.086458
AMZN    124186000  0.247511 0.251290

In [None]:
# Use Python to group data by ticker, find the mean of the remaining columns
grouped_stock_data = raw_stock_data.groupby(by=["ticker"], axis=0).mean()

In [None]:
        volume	        open	    close
ticker
AAPL	1.149287e+08	34.998377	34.986851
AMZN	1.434213e+08	30.844692	30.830233

- Can use `.min()`, `.max()`, and `.sum()` to aggregate data

# **Applying advanced transformations to DataFrames**

- At times, transformation logic will be more complex than what pandas' built-in functionality can handle. 
- Luckily, pandas offers the dot-apply method, which takes a function containing the custom transformation logic, and applies it to the DataFrame. 
- To illustrate this, let's use an example. We'd like to classify the price changes that takes place for an asset by creating a "change" column. 
- First, we'll define a function, called classify_change. 
- This function takes a row, and returns "Increase" or "Decrease" based on the difference between the "open" and "close" values.
- Once this function is defined, it's "applied" to each row, using the dot-apply method. 
- Setting axis equal to one ensures that the classify_change function is applied to each row. 
- The result is written to the "change" column, and shown in the DataFrame below.

The `.apply()` method can handle more advanced transformations

In [None]:
Before transformation

ticker ...  open      close
AAPL        0.121875  0.097917
AAPL        0.098146  0.086458
AMZN        0.247511  0.251290

In [None]:
def classify_change(row):
  change = row["close"] - row["open"]
  if change > 0:
    return "Increase"
  else:
    return "Decrease"

In [None]:
# Apply transformation to DataFrame
raw_stock_data["change"] = raw_stock_data.apply(
    classify_change,
    axis=1
)

In [None]:
After transformation

ticker  ... open      close     change
AAPL        0.121875  0.097917  Decrease
AAPL        0.098146  0.086458  Decrease
AMZN        0.247511  0.251290  Increase