# Creating and Visualizing DataFrames
Learn to visualize the contents of your DataFrames, handle missing data values, and import data from and export data to CSV files.

#### Import libraries

In [None]:
# Import pandas with alias pd
import pandas as pd

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt
%matplotlib inline

#### Import data and other dependencies

In [None]:
plt.rcParams["figure.dpi"] = 150

In [None]:
avocados = pd.read_csv('../../data/avocado.csv')

***
<br>

# Visualizing your data

# Which avocado size is most popular?
Avocados are increasingly popular and delicious in guacamole and on toast. The Hass Avocado Board keeps track of avocado supply and demand across the USA, including the sales of three different sizes of avocado. In this section, you'll use a bar plot to figure out which size is the most popular.

Bar plots are great for revealing relationships between categorical (size) and numeric (number sold) variables, but you'll often have to manipulate your data first in order to get the numbers you need for plotting.

`pandas` has been imported, and `avocados` is available.
<br>
<br>
##### Instructions
- Print the head of the `avocados` dataset. **What columns are available?**
- For each avocado size group, calculate the total number sold, storing as `nb_sold_by_size`.
- Create a bar plot of the number of avocados sold by size.
- Show the plot.

In [None]:
# Look at the first few rows of data
print(avocados.head())

In [None]:
# Get the total number of avocados sold of each size
nb_sold_by_size = avocados.groupby('size')['nb_sold'].sum()

# Create a bar plot of the number of avocados sold by size
nb_sold_by_size.plot(kind='bar')

# Show the plot
plt.show()

# Changes in sales over time
Line plots are designed to visualize the relationship between two numeric variables, where each data values is connected to the next one. They are especially useful for visualizing the change in a number over time since each time point is naturally connected to the next time point. In this exercise, you'll visualize the change in avocado sales over three years.
<br>
<br>
##### Instructions
- Get the total number of avocados sold on each date. _The DataFrame has two rows for each date—one for organic, and one for conventional_. Save this as `nb_sold_by_date`.
- Create a line plot of the number of avocados sold.
- Show the plot.

In [None]:
# Get the total number of avocados sold on each date
nb_sold_by_date = avocados.groupby('date')['nb_sold'].sum()

# Create a line plot of the number of avocados sold by date
nb_sold_by_date.plot(kind='line')

# Show the plot
plt.show()

# Avocado supply and demand
Scatter plots are ideal for visualizing relationships between numerical variables. In this section, you'll compare the number of avocados sold to average price and see if they're at all related. If they're related, you may be able to use one number to predict the other.
<br>
<br>
##### Instructions
- Create a scatter plot with `nb_sold` on the x-axis and `avg_price` on the y-axis. Title it `Number of avocados sold vs. average price`.
- Show the plot.

In [None]:
# Scatter plot of avg_price vs. nb_sold with title
avocados.plot(x='nb_sold', y='avg_price', kind='scatter', title='Number of avocados sold vs. average price')

# Show the plot
plt.show()

It looks like when more avocados are sold, prices go down. However, this doesn't mean that fewer sales _causes_ higher prices - we can only tell that they're correlated with each other.

# Price of conventional vs. organic avocados
Creating multiple plots for different subsets of data allows you to compare groups. In this section, you'll create multiple histograms to compare the prices of conventional and organic avocados.
<br>
<br>
##### Instructions 1/3
- Subset avocados for the conventional type, and the average price column. Create a histogram.
- Create a histogram of `avg_price` for organic type avocados.
- Add a legend to your plot, with the names "conventional" and "organic".
- Show your plot.



In [None]:
# Histogram of conventional avg_price
avocados[avocados["type"] == "conventional"]["avg_price"].hist()

# Histogram of organic avg_price
avocados[avocados["type"] == "organic"]["avg_price"].hist()

# Add a legend
plt.legend(["conventional", "organic"])

# Show the plot
plt.show()

##### Instructions 2/3
- Modify your code to adjust the transparency of both histograms to `0.5` to see how much overlap there is between the two distributions.
    - Hint: Use the `alpha` argument to adjust plot transparency.

In [None]:
# Modify histogram transparency to 0.5
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5)

# Modify histogram transparency to 0.5
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5)

# Add a legend and show plot
plt.legend(["conventional", "organic"])
plt.show()

##### Instructions 3/3
- Modify your code to use 20 bins in both histograms.

In [None]:
# Modify bins to 20
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins=20)

# Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins=20)

# Add a legend and show plot
plt.legend(["conventional", "organic"])
plt.show()

We can see that on average, organic avocados are more expensive than conventional ones, but their price distributions have some overlap.

***
<br>

# Missing values

# Finding missing values
Missing values are everywhere, and you don't want them interfering with your work. Some functions ignore missing data by default, but that's not always the behavior you might want. Some functions can't handle missing values at all, so these values need to be taken care of before you can use them. If you don't know where your missing values are, or if they exist, you could make mistakes in your analysis. In this exercise, you'll determine if there are missing values in the dataset, and if so, how many.

`avocados_2016`, a subset of `avocados` that contains only sales from 2016, is available.
<br>
<br>
##### Instructions
- Print a DataFrame that shows whether each value in `avocados_2016` is missing or not.
- Print a summary that shows whether _any_ value in each column is missing or not.
- Create a bar plot of the total number of missing values in each column.
<br>

##### Hint
- Use the `.isna()` method to check each individual value for missingness.
- The `.any()` method will take a DataFrame of Booleans and flatten it to indicate if there are _any_ `True` values in each column.
- The `.sum()` method can be used on a DataFrame of Booleans to count the number of `True` values in each column.
- Call `.plot()`, setting `kind` to `bar` to draw a bar plot.

In [None]:
# Check individual values for missing values
print(avocados.isna())

In [None]:
# Check each column for missing values
print(avocados.isna().any())

In [None]:
# Bar plot of missing values by variable
avocados.isna().sum().plot(kind="bar")

# Show plot
plt.show()

# Removing missing values
Now that you know there are some missing values in your DataFrame, you have a few options to deal with them. One way is to remove them from the dataset completely. In this section, you'll remove missing values by removing all rows that contain missing values.

`avocados_2016` is NOT available, but `avocados` is.
<br>
<br>
##### Instructions
- Remove the rows of `avocados_2016` that contain missing values and store the remaining rows in `avocados_complete`.
- Verify that all missing values have been removed from `avocados_complete`. Calculate each column that has NAs and print.
<br>

##### Hint
- Call the method that ***drops*** missing (NA) values.
- Use the `.any()` method in combination with another to check columns for missing values.

In [None]:
# # Remove rows with missing values
# avocados_complete = ____
#
# # Check if any columns contain missing values
# print(____)

# Replacing missing values
Another way of handling missing values is to replace them all with the same value. For numerical variables, one option is to replace values with 0— you'll do this here. However, when you replace missing values, you make assumptions about what a missing value means. In this case, you will assume that a missing number sold means that no sales for that avocado type were made that week.

In this section, you'll see how replacing missing values can affect the distribution of a variable using histograms. You can plot histograms for multiple variables at a time as follows: `dogs[["height_cm", "weight_kg"]].hist()`

The `avocados_2016` dataset is again, NOT available.
<br>
<br>
##### Instructions 1/2
- A list has been created, `cols_with_missing`, containing the names of columns with missing values: `small_sold`, `large_sold`, and `xl_sold`.
- Create a histogram of those columns.
- Show the plot.

In [None]:
# # List the columns with missing values
# cols_with_missing = ["small_sold", "large_sold", "xl_sold"]
#
# # Create histograms showing the distributions cols_with_missing
# avocados_2016[cols_with_missing].hist()
#
# # Show the plot
# plt.show()

##### Instructions 2/2
- Replace the missing values of `avocados_2016` with `0`s and store the result as `avocados_filled`.
- Create a histogram of the `cols_with_missing` columns of `avocados_filled`.

In [None]:
# # Fill in missing values with 0
# avocados_filled = avocados_2016.fillna(0)
#
# # Create histograms of the filled columns
# avocados_filled[cols_with_missing].hist()
#
# # Show the plot
# plt.show()

Notice how the distribution has changed shape after replacing missing values with zeros.